Implementing OCR in search

Posted by Chaithragm under C# on 3/27/2013 | Points: 10 | Views : 3175 | Status : [Member] | Replies : 2

Write New Post |

Search Forums | Answered

Resolved Posts |

Un Answered Posts |

Forums Home

I have to search the content in pdf documents.. for that i have used OCR in my application
pdf documents are saved in D:/Books (directory)
I have used this code ..its not working

public void CheckFileType(string directoryPath)
{
IEnumerator files = Directory.GetFiles(directoryPath).GetEnumerator();
while (files.MoveNext())
{
//get file extension
string fileExtension = Path.GetExtension(Convert.ToString(files.Current));

//get file name without extenstion
string fileName = Convert.ToString(files.Current).Replace(fileExtension, string.Empty);

//Check for JPG File Format
if (fileExtension == ".pdf" || fileExtension == ".PDF") // or // ImageFormat.Jpeg.ToString()
{
try
{
//OCR Operations ...
MODI.Document md = new MODI.Document();
md.Create(Convert.ToString(files.Current));
md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
MODI.Image image = (MODI.Image)md.Images[0];

//create text file with the same Image file name
FileStream createFile = new FileStream(fileName + ".txt", FileMode.CreateNew);

//save the image text in the text file
StreamWriter writeFile = new StreamWriter(createFile);
writeFile.Write(image.Layout.Text);
writeFile.Close();
}
catch (Exception e)
{
// //MessageBox.Show("This Image hasn't a text or has a problem",
// //"OCR Notifications",
// //MessageBoxButtons.OK, MessageBoxIcon.Information);
// MessageBox.Show(e.ToString());
//// MessageBox.Show(e.StackTrace);
}
}
}
}

Reply | Reply with Attachment

Alert Moderator

Responses

Posted by: Arronlee on: 8/8/2013 [Member] Starter | Points: 25

0	WE all know that there are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters. Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching" or "pattern recognition". This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly. Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. These are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR softwarehttp://www.yiigo.com/net-document-image-plugin/ocr-plugin/ . Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match. Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded). So I wonder which kind of OCR are you developing? Chaithragm, if this helps please login to Mark As Answer. \| Alert Moderator

WE all know that there are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters.
Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching" or "pattern recognition". This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly.
Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. These are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR softwarehttp://www.yiigo.com/net-document-image-plugin/ocr-plugin/ . Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match.
Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).
So I wonder which kind of OCR are you developing?

Chaithragm, if this helps please login to Mark As Answer. | Alert Moderator

Posted by: Barcodelib on: 2/12/2014 [Member] Starter | Points: 25

0	I just wonder can the OCR recognition SDK allow developers to OCR color text accurately. http://www.rasteredge.com/how-to/vb-net-imaging/ocr-sdk/ Chaithragm, if this helps please login to Mark As Answer. \| Alert Moderator

Latest Posts