Tess4J Use Example_error net.sourceforge.tess4j.tesseract

本文链接：https://blog.youkuaiyun.com/wangbingfengf98/article/details/92192166

Tesseract是一款由HP开发、现由Google赞助的OCR引擎，支持多种语言和字体。Tess4J是Tesseract的Java封装库，允许从Java调用OCR功能，包括PDF扫描和文字定位。本文介绍了在使用Tess4J过程中可能遇到的问题及其解决方案，如文件类型错误、环境变量设置、参数缺失等，并提供了相关资源链接。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Tesseract is ocr engine once developed by HP. Currently it is an opensource project sponsored by Google. The software is capable of taking a picture and transforming it into text. It supports a wide range of languages and fonts. Tesseract is a rather advanced engine. Unlike some of the available cloud based OCR services, it for example provides the option to get information on location of each word found on a page. This is important if you want to parse the fetched text.

The engine is written in C++. This makes it somewhat hard to use it from Java. Fortunately there is Java ‘wrapper’ available named Tess4J. Tess4J also provides the option to scan pdf documents next to tiffs.

package tess4jTest;

import java.io.File;
import net.sourceforge.tess4j.*;

public class Testtess {

    public static void main(String[] args) {
        File imageFile = new File(System.getProperty("user.dir") + "/test.png");
        Tesseract tessInst = new Tesseract();
        tessInst.setDatapath(System.getProperty("user.dir") + "/tessdata");
        tessInst.setLanguage("eng");// eng.traineddata is in /tessdata direcotry

        Rectangle rectangle = new Rectangle(0, 0, 763, 534);// recognize special area letters
        try {
            String result= tessInst.doOCR(image, rectangle);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }

    }

}

FAQ

1. ERROR net.sourceforge.tess4j.Tesseract - Not a JPEG file: starts with 0x89 0x50

Solution: the file is not acctually JPEG file, select true JPEG file.

2. WARN Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Solution: option A, tessInst.setDatapath(System.getProperty("user.dir") + "/tessdata");

option B, set TESSDATA_PREFIX your environment. Which is Tesseract's tessdata default value. If do not set, it will

open ./*.traineddata file.

3. "Warning: Parameter not found: enable_new_segsearch"

Solution: Works with this eng.traineddata: https://github.com/tesseract-ocr/tessdata_fast/blob/master/eng.traineddata

Note: language data file best use tessdata_best's file. If you want to recognize chinese, select chi_sim.traineddata, and download it, move it in your tessdata directory.

Java's print API basically works on the assumption that everything is done at 72 dpi. This means that you can use this as bases for converting to/from different measurements

references:

1. http://www.jbrandsma.com/news/2015/12/07/ocr-with-java-and-tesseract/

2. https://sourceforge.net/projects/tess4j/

3. https://github.com/tesseract-ocr/tessdata_best

4. https://www.b4x.com/android/forum/threads/solved-tesseract-api-a-120-opotunity.101482/

6. https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/

7. https://stackoverflow.com/questions/18975595/how-to-design-an-image-in-java-to-be-printed-on-a-300-dpi-printer