Tesseract is ocr engine once developed by HP. Currently it is an opensource project sponsored by Google. The software is capable of taking a picture and transforming it into text. It supports a wide range of languages and fonts. Tesseract is a rather advanced engine. Unlike some of the available cloud based OCR services, it for example provides the option to get information on location of each word found on a page. This is important if you want to parse the fetched text.
The engine is written in C++. This makes it somewhat hard to use it from Java. Fortunately there is Java ‘wrapper’ available named Tess4J. Tess4J also provides the option to scan pdf documents next to tiffs.
package tess4jTest;
import java.io.File;
import net.sourceforge.tess4j.*;
public class Testtess {
public static void main(String[] args) {
File imageFile = new File(System.getProperty("user.dir") + "/test.png");
Tesseract tessInst = new Tesseract();
tessInst.setDatapath(System.getProperty("user.dir") + "/tessdata");
tessInst.setLanguage("eng");// eng.traineddata is in /tessdata direcotry
Rectangle rectangle = new Rectangle(0, 0, 763, 534);// recognize special area letters
try {
String result= tessInst.doOCR(image, rectangle);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
FAQ
1. ERROR net.sourceforge.tess4j.Tesseract - Not a JPEG file: starts with 0x89 0x50
Solution: the file is not acctually JPEG file, select true JPEG file.
2. WARN Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Solution: option A, tessInst.setDatapath(System.getProperty("user.dir") + "/tessdata");
option B, set TESSDATA_PREFIX your environment. Which is Tesseract's tessdata default value. If do not set, it will
open ./*.traineddata file.
3. "Warning: Parameter not found: enable_new_segsearch"
Solution: Works with this eng.traineddata: https://github.com/tesseract-ocr/tessdata_fast/blob/master/eng.traineddata
Note: language data file best use tessdata_best's file. If you want to recognize chinese, select chi_sim.traineddata, and download it, move it in your tessdata directory.
Java's print API basically works on the assumption that everything is done at 72 dpi. This means that you can use this as bases for converting to/from different measurements
references:
1. http://www.jbrandsma.com/news/2015/12/07/ocr-with-java-and-tesseract/
2. https://sourceforge.net/projects/tess4j/
3. https://github.com/tesseract-ocr/tessdata_best
4. https://www.b4x.com/android/forum/threads/solved-tesseract-api-a-120-opotunity.101482/
6. https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/