ocr linux

Screen shot ocr linux

When we scan the page of a book, is recorded in the compu an image, like a picture of that page, usually in. Jpg.  This means that we can do about that image processing basic operations like deleting text characters, copying, pasting, indentation, etc.., Simply because it is an image.  If we open it with a word processor, be one that supports images and will not support any of the above procedures.  For tranforme that image in a text file (a. Jpg to a. Txt,. Rtf,. Doc, or whatever) need to do optical character recognition ( linux ocr software ) is the process by which a program OCR reads the image and recognize that this is a rendondelito "or" that stick an "I", etc.. 

In Windows there is a great software, Abbyy Fine Reader, which lets you do this in a simple and automatic.  In linux, as always, we have several options.  I'm going to explain is done from the console, and requires a bit of work, but also gives us enough control over what we do.

 Nesarios Ingredients:

 Tesseract-ocr-
 -Teseract-ocr-spa
 -Imagemagick

 Typically installed in ubuntu, but not wine, you know: system, administration, synaptic package manager (or the way you have to install packages).

 Tesseract

 The program in question is called tesseract and have a simple syntax:

 elcastillo elcastillo.tif tesseract-l spa

 Produces a text file elcastillo.txt

 As you can see the image file is not a. Jpg but a. Tif so if we have one. Jpg must convert.  For this we will use a program that is part of package imagemagick image manipulation and is called convert.  In this case, the syntax for conversion is:

If we have a scanned book had better put all. Jpg in a folder and run these statements that cover entire folder and run:

 for k in $ (ls *. jpg) do convert $ k $ k.tif; donate

 Thus apply the convert command with. Tif to all images. Jpg in the folder.

 Once we have the TIF, we run the following:

 for i in *. tif; do tesseract $ i $ i-l spa; donate

 which means, run the OCR (tesseract) over all. tif's in that folder.

 In this way we will elcastillo001.jpg.tif.txt individual files (which are text files).

 You see it is not as simple as other things but in between learning to do several things.