Tess4J Use Example

Tesseract是一款由HP开发、现由Google赞助的OCR引擎,支持多种语言和字体。Tess4J是Tesseract的Java封装库,允许从Java调用OCR功能,包括PDF扫描和文字定位。本文介绍了在使用Tess4J过程中可能遇到的问题及其解决方案,如文件类型错误、环境变量设置、参数缺失等,并提供了相关资源链接。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 Tesseract is ocr engine once developed by HP. Currently it is an opensource project sponsored by Google. The software is capable of taking a picture and transforming it into text. It supports a wide range of languages and fonts. Tesseract is a rather advanced engine. Unlike some of the available cloud based OCR services, it for example provides the option to get information on location of each word found on a page. This is important if you want to parse the fetched text.

The engine is written in C++. This makes it somewhat hard to use it from Java. Fortunately there is Java ‘wrapper’ available named Tess4J. Tess4J also provides the option to scan pdf documents next to tiffs. 

package tess4jTest;

import java.io.File;
import net.sourceforge.tess4j.*;

public class Testtess {

    public static void main(String[] args) {
        File imageFile = new File(System.getProperty("user.dir") + "/test.png");
        Tesseract tessInst = new Tesseract();
        tessInst.setDatapath(System.getProperty("user.dir") + "/tessdata");
        tessInst.setLanguage("eng");// eng.traineddata is in /tessdata direcotry

        Rectangle rectangle = new Rectangle(0, 0, 763, 534);// recognize special area letters
        try {
            String result= tessInst.doOCR(image, rectangle);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }

    }

}

FAQ

1. ERROR net.sourceforge.tess4j.Tesseract - Not a JPEG file: starts with 0x89 0x50

Solution: the file is not acctually JPEG file, select true JPEG file.

2. WARN Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Solution: option A, tessInst.setDatapath(System.getProperty("user.dir") + "/tessdata");

               option B, set TESSDATA_PREFIX your environment. Which is Tesseract's tessdata default value. If do not set, it will

               open ./*.traineddata file.

3. "Warning: Parameter not found: enable_new_segsearch" 

Solution: Works with this eng.traineddata: https://github.com/tesseract-ocr/tessdata_fast/blob/master/eng.traineddata

 

Note: language data file best use tessdata_best's file. If you want to recognize chinese, select chi_sim.traineddata, and download it, move it in your tessdata directory.

Java's print API basically works on the assumption that everything is done at 72 dpi. This means that you can use this as bases for converting to/from different measurements

references:

1. http://www.jbrandsma.com/news/2015/12/07/ocr-with-java-and-tesseract/

2. https://sourceforge.net/projects/tess4j/

3. https://github.com/tesseract-ocr/tessdata_best

4. https://www.b4x.com/android/forum/threads/solved-tesseract-api-a-120-opotunity.101482/

6. https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/

7. https://stackoverflow.com/questions/18975595/how-to-design-an-image-in-java-to-be-printed-on-a-300-dpi-printer

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值