Tess4J Use Example

使用Tess4J进行OCR操作指南
Tesseract是一款由HP开发、现由Google赞助的OCR引擎,支持多种语言和字体。Tess4J是Tesseract的Java封装库,允许从Java调用OCR功能,包括PDF扫描和文字定位。本文介绍了在使用Tess4J过程中可能遇到的问题及其解决方案,如文件类型错误、环境变量设置、参数缺失等,并提供了相关资源链接。

 Tesseract is ocr engine once developed by HP. Currently it is an opensource project sponsored by Google. The software is capable of taking a picture and transforming it into text. It supports a wide range of languages and fonts. Tesseract is a rather advanced engine. Unlike some of the available cloud based OCR services, it for example provides the option to get information on location of each word found on a page. This is important if you want to parse the fetched text.

The engine is written in C++. This makes it somewhat hard to use it from Java. Fortunately there is Java ‘wrapper’ available named Tess4J. Tess4J also provides the option to scan pdf documents next to tiffs. 

package tess4jTest;

import java.io.File;
import net.sourceforge.tess4j.*;

public class Testtess {

    public static void main(String[] args) {
        File imageFile = new File(System.getProperty("user.dir") + "/test.png");
        Tesseract tessInst = new Tesseract();
        tessInst.setDatapath(System.getProperty("user.dir") + "/tessdata");
        tessInst.setLanguage("eng");// eng.traineddata is in /tessdata direcotry

        Rectangle rectangle = new Rectangle(0, 0, 763, 534);// recognize special area letters
        try {
            String result= tessInst.doOCR(image, rectangle);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }

    }

}

FAQ

1. ERROR net.sourceforge.tess4j.Tesseract - Not a JPEG file: starts with 0x89 0x50

Solution: the file is not acctually JPEG file, select true JPEG file.

2. WARN Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Solution: option A, tessInst.setDatapath(System.getProperty("user.dir") + "/tessdata");

               option B, set TESSDATA_PREFIX your environment. Which is Tesseract's tessdata default value. If do not set, it will

               open ./*.traineddata file.

3. "Warning: Parameter not found: enable_new_segsearch" 

Solution: Works with this eng.traineddata: https://github.com/tesseract-ocr/tessdata_fast/blob/master/eng.traineddata

 

Note: language data file best use tessdata_best's file. If you want to recognize chinese, select chi_sim.traineddata, and download it, move it in your tessdata directory.

Java's print API basically works on the assumption that everything is done at 72 dpi. This means that you can use this as bases for converting to/from different measurements

references:

1. http://www.jbrandsma.com/news/2015/12/07/ocr-with-java-and-tesseract/

2. https://sourceforge.net/projects/tess4j/

3. https://github.com/tesseract-ocr/tessdata_best

4. https://www.b4x.com/android/forum/threads/solved-tesseract-api-a-120-opotunity.101482/

6. https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/

7. https://stackoverflow.com/questions/18975595/how-to-design-an-image-in-java-to-be-printed-on-a-300-dpi-printer

@PostMapping(value = "/compare") public R compare(@RequestPart("File") MultipartFile file, @RequestPart("jsonContent") String jsonContent) { //读取PDF文本 String pdfText = compareService.extractContent(file); //解析JSON配置 JsonNode jsonConfig = null; try { jsonConfig = compareService.parseJson(jsonContent); } catch (Exception e) { return R.fail(MessageUtils.message("failed.convert.json")); } //执行对比校验 List<ValidationResult> results = compareService.compareContent(pdfText, jsonConfig); //返回没有匹配成功的数据 List<ValidationResult> failedResults = new ArrayList<>(); for (ValidationResult result : results) { if (!result.isValid()) { failedResults.add(result); } } if (failedResults.isEmpty()) { return R.ok("条件符合规范"); } else { return R.ok(failedResults); } }package com.example.demo.config; import com.fasterxml.jackson.databind.DeserializationFeature; import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.PropertyNamingStrategies; import com.fasterxml.jackson.dataformat.xml.XmlMapper; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.http.MediaType; import org.springframework.http.converter.HttpMessageConverter; import org.springframework.http.converter.json.MappingJackson2HttpMessageConverter; import org.springframework.http.converter.xml.MappingJackson2XmlHttpMessageConverter; import org.springframework.web.accept.ContentNegotiationManager; import org.springframework.web.accept.ContentNegotiationManagerFactoryBean; import org.springframework.web.servlet.config.annotation.ContentNegotiationConfigurer; import org.springframework.web.servlet.config.annotation.EnableWebMvc; import org.springframework.web.servlet.config.annotation.WebMvcConfigurer; import java.util.Arrays; import java.util.HashMap; import java.util.List; import java.util.Map; @Configuration @EnableWebMvc public class WebConfig implements WebMvcConfigurer { @Override public void configureMessageConverters(List<HttpMessageConverter<?>> converters) { // ObjectMapper objectMapper = JsonMapper.builder() MappingJackson2HttpMessageConverter jsonConverter = new MappingJackson2HttpMessageConverter(); ObjectMapper objectMapper = new ObjectMapper(); objectMapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES); jsonConverter.setObjectMapper(new ObjectMapper().setPropertyNamingStrategy(PropertyNamingStrategies.SNAKE_CASE)); jsonConverter.setSupportedMediaTypes(Arrays.asList(MediaType.APPLICATION_JSON, new MediaType("application", "*+json"))); MappingJackson2XmlHttpMessageConverter xmlConverter = new MappingJackson2XmlHttpMessageConverter(); xmlConverter.setObjectMapper(new XmlMapper().setPropertyNamingStrategy(PropertyNamingStrategies.SNAKE_CASE)); xmlConverter.setSupportedMediaTypes(Arrays.asList(MediaType.APPLICATION_XML, new MediaType("application", "*+xml"))); converters.add(jsonConverter); converters.add(xmlConverter); } @Override public void configureContentNegotiation(ContentNegotiationConfigurer configurer) { configurer .mediaType("json", MediaType.APPLICATION_JSON) .mediaType("xml", MediaType.APPLICATION_XML); } @Bean public ContentNegotiationManager contentNegotiationManager() { ContentNegotiationManagerFactoryBean factory = new ContentNegotiationManagerFactoryBean(); factory.setDefaultContentType(MediaType.APPLICATION_JSON); factory.setFavorParameter(true); factory.setParameterName("format"); factory.setIgnoreAcceptHeader(true); factory.setUseRegisteredExtensionsOnly(true); Map<String, MediaType> mediaTypes = new HashMap<>(); mediaTypes.put("json", MediaType.APPLICATION_JSON); mediaTypes.put("xml", MediaType.APPLICATION_XML); factory.addMediaTypes(mediaTypes); return factory.getObject(); } } package com.luxsan.domain; import com.fasterxml.jackson.annotation.JsonProperty; import com.fasterxml.jackson.dataformat.xml.annotation.JacksonXmlRootElement; import lombok.AllArgsConstructor; import lombok.Data; import lombok.NoArgsConstructor; /** * 校验结果封装类 */ @Data @AllArgsConstructor @NoArgsConstructor @JacksonXmlRootElement(localName = "validationResult") public class ValidationResult { @JsonProperty("checkType") private String checkType; // 检查类型: FIELD/REGEX/NUMERIC @JsonProperty("fieldName") private String fieldName; // 字段名 @JsonProperty("expected") private String expected; // 预期值 @JsonProperty("actual") private String actual; // 实际值 @JsonProperty("valid") private boolean valid; // 是否通过 } package com.luxsan.service; import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.dataformat.xml.XmlMapper; import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.seg.common.Term; import com.luxsan.common.core.utils.MessageUtils; import com.luxsan.domain.ValidationResult; import jakarta.annotation.PostConstruct; import lombok.RequiredArgsConstructor; import net.sourceforge.tess4j.Tesseract; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.springframework.stereotype.Service; import org.springframework.web.multipart.MultipartFile; import java.io.File; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.StandardCopyOption; import java.util.*; @RequiredArgsConstructor @Service public class ReadFileContentService { private final ObjectMapper objectMapper = new ObjectMapper(); private final XmlMapper xmlMapper = new XmlMapper(); private Tesseract tesseract; @PostConstruct public void initOcrEngine() { tesseract = new Tesseract(); try { //语言包路径和支持语言 tesseract.setDatapath("D:\\maven_use\\lingxi-lhc\\lingxi-ai-extend\\lingxi-ai-comparison\\src\\main\\resources\\tessdata"); tesseract.setLanguage("eng+chi_sim"); tesseract.setPageSegMode(6); // 自动页面分割 tesseract.setOcrEngineMode(1); // LSTM引擎 } catch (Exception e) { throw new RuntimeException("OCR引擎初始化失败: " + e.getMessage(), e); } } /** * 支持PDF和图片 */ public String extractContent(MultipartFile file) { String contentType = file.getContentType(); String fileName = file.getOriginalFilename().toLowerCase(); if (contentType == null) { return "不支持的文件类型: " + contentType; } if (fileName.endsWith(".pdf")) { return readPdfText(file); } return extractImageText(file); } /** * 读取PDF文本内容 * * @param file * @return */ public String readPdfText(MultipartFile file) { try (PDDocument doc = PDDocument.load(file.getInputStream())) { PDFTextStripper stripper = new PDFTextStripper(); // 设置行分隔符 stripper.setLineSeparator("\n"); // 设置字符间距 stripper.setSortByPosition(true); String rawText = stripper.getText(doc); System.out.println("内容" + rawText); return rawText.trim(); } catch (Exception e) { return MessageUtils.message("file.red.pdf.error"); } } /** * OCR识别图片内容 */ private String extractImageText(MultipartFile file) { try { // 创建临时文件 Path tempFile = Files.createTempFile("ocr_", getFileExtension(file.getOriginalFilename())); Files.copy(file.getInputStream(), tempFile, StandardCopyOption.REPLACE_EXISTING); // 执行OCR识别 File imageFile = tempFile.toFile(); String result = tesseract.doOCR(imageFile) .replaceAll("\\s+", " ").trim(); System.out.println("读取的内容" + result); // 清理临时文件 Files.deleteIfExists(tempFile); return result; } catch (Exception e) { return "OCR处理失败: " + e.getMessage(); } } private String getFileExtension(String filename) { if (filename == null) return ".tmp"; int dotIndex = filename.lastIndexOf('.'); return (dotIndex == -1) ? ".tmp" : filename.substring(dotIndex); } public JsonNode parseJson(String jsonContent) throws Exception { // return this.objectMapper.readTree(jsonContent); try { // 尝试解析 JSON return objectMapper.readTree(jsonContent); } catch (Exception e) { try { // 尝试解析 XML return xmlMapper.readTree(jsonContent); } catch (Exception ex) { throw new Exception("Failed to parse JSON or XML", ex); } } } public List<ValidationResult> compareContent(String pdfText, JsonNode jsonConfig) { List<ValidationResult> results = new ArrayList<>(); // 去除读取内容中的多余空格 pdfText = pdfText.replaceAll("\\s+", ""); // 处理JSON结构(支持单个对象或数组) JsonNode dataNode; if (jsonConfig.isArray() && jsonConfig.size() > 0) { dataNode = jsonConfig.get(0); } else if (jsonConfig.isObject()) { dataNode = jsonConfig; } else { results.add(new ValidationResult("ERROR", "JSON格式错误", "期望一个对象或包含对象的数组", "实际格式不匹配", false)); return results; } // 动态定义地址字段列表 Set<String> addressFields = new HashSet<>(); Iterator<String> fieldNames = dataNode.fieldNames(); while (fieldNames.hasNext()) { String fieldName = fieldNames.next(); if (fieldName.contains("CITY") || fieldName.contains("STATE") || fieldName.contains("ZIP")) { addressFields.add(fieldName); } } // 字段直接匹配 checkNonAddressFields(pdfText, dataNode, results, addressFields); // 连续匹配 checkAddressFields(pdfText, dataNode, results, addressFields); return results; } /** * 检查 JSON 中非地址字段是否严格存在于 PDF 文本中 */ private void checkNonAddressFields(String pdfText, JsonNode jsonConfig, List<ValidationResult> results, Set<String> addressFields) { Iterator<Map.Entry<String, JsonNode>> fields = jsonConfig.fields(); while (fields.hasNext()) { Map.Entry<String, JsonNode> entry = fields.next(); String fieldName = entry.getKey(); JsonNode valueNode = entry.getValue(); if (valueNode.isValueNode() && !addressFields.contains(fieldName)) { //去除多余空格 String expectedValue = valueNode.asText().trim().replaceAll("\\s+", ""); if (expectedValue.isEmpty()) continue; // 直接进行字符串匹配 boolean found = pdfText.contains(expectedValue); results.add(new ValidationResult( "FIELD", fieldName, expectedValue, found ? "Found" : "Not Found", found )); } } } /** * 检查 JSON 中地址字段是否严格存在于 PDF 文本中 */ private void checkAddressFields(String pdfText, JsonNode jsonConfig, List<ValidationResult> results, Set<String> addressFields) { // HanLP分词 List<Term> terms = HanLP.segment(pdfText); List<String> addressParts = new ArrayList<>(); for (Term term : terms) { String word = term.word; if (word.matches("\\d{5,7}")) { addressParts.add(word); } else if (term.nature.toString().startsWith("ns")) { addressParts.add(word); } } // 遍历 JSON 配置中的地址字段 Iterator<Map.Entry<String, JsonNode>> fields = jsonConfig.fields(); while (fields.hasNext()) { Map.Entry<String, JsonNode> entry = fields.next(); String fieldName = entry.getKey(); JsonNode valueNode = entry.getValue(); if (valueNode.isValueNode() && addressFields.contains(fieldName)) { //去除多余空格 String expectedValue = valueNode.asText().trim().replaceAll("\\s+", ""); if (expectedValue.isEmpty()) continue; boolean found = false; for (String part : addressParts) { if (part.equals(expectedValue)) { found = true; break; } } results.add(new ValidationResult( "FIELD", fieldName, expectedValue, found ? "Found" : "Not Found", found )); } } } } 我的方法 我的结果我选择json的时候结果是对的 然后我选择xml的时候 结果还是json
最新发布
07-11
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值