Trained Tesseract on 瘦金体 successfully!!

博主成功训练Tesseract以识别瘦金体,过程中详细记录了使用Early Modern OCR Project的资源、Tesseract训练步骤,包括图像处理、box文件修改、生成训练文件等。在训练过程中遇到了问题,但最终通过使用chi_sim语言训练解决了问题,成功生成了chi.slimqjs.traineddata。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

成功训练Tesseract识别瘦金体(其中的24个字:))

--插播--
非常好的训练文件,来自 the Early Modern OCR Project (eMOP)。应该是PRIMALabs的工作。
TesseractTraining
Testing with Tesseract

网上很多资源。列几个步骤明确表达清晰的:
利用jTessBoxEditor工具进行Tesseract3.02.02样本训练,提高验证码识别率

  • 提到了合并样本图片。Multi-page TIFF file。
  • 注意: DO NOT MIX FONTS IN AN IMAGE FILE (In a single .tr file to be precise.)
  • 有输出中间过程的console结果,可以对照自己训练时的输出,非常有用
  • 用到了shapeclustering (有一些教程没有用这个功能好像也没问题?)
  • 最后有步骤列表:
    1、合并图片
    2、生成box文件
    tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 batch.nochop makebox
    3、修改box文件
    4、生成font_properties
    echo fontyp 0 0 0 0 0 >font_properties
    5、生成训练文件
    tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 nobatch box.train
    6、生成字符集文件
    unicharset_extractor langyp.fontyp.exp0.box
    7、生成shape文件
    shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
    8、生成聚集字符特征文件
    mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
    9、生成字符正常化特征文件
    cntraining langyp.fontyp.exp0.tr
    10、更名
    rename normproto fontyp.normproto
    rename inttemp fontyp.inttemp
    rename pffmtable fontyp.pffmtable
    rename unicharset fontyp.unicharset
    rename shapetable fontyp.shapetable
    11、合并训练文件,生成fontyp.traineddata
    combine_tessdata fontyp.

我的训练结果

  • 上面第5步,真正重要的步骤。官网说用什么语言没关系。测试时发现用默认英语好像错误多一些?
    tesseract.exe chi.slimqjs.exp0.tif chi.slimqjs.exp0 nobatch box.train
    Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
    Page 1
    row xheight=86, but median xheight = 0.5
    row xheight=77, but median xheight = 0.5
    row xheight=68, but median xheight = 0.5
    row xheight=77, but median xheight = 0.5
    FAIL!
    APPLY_BOXES: boxfile line 1/里 ((73,490),(185,582)): FAILURE! Couldn’t find a matching blob
    FAIL!
    APPLY_BOXES: boxfile line 2/凤 ((226,480),(355,600)): FAILURE! Couldn’t find a matching blob
    FAIL!
    APPLY_BOXES: boxfile line 5/露 ((677,478),(803,603)): FAILURE! Couldn’t find a matching blob
    FAIL!
    APPLY_BOXES: boxfile line 6/魁 ((828,480),(951,592)): FAILURE! Couldn’t find a matching blob
    APPL
### Tesseract 5 OCR Engine Documentation and Usage #### Introduction to Tesseract 5 Tesseract is a powerful optical character recognition (OCR) engine capable of recognizing over 100 different languages out-of-the-box[^5]. The fifth major version introduces several improvements including enhanced language models, better accuracy, and improved performance. #### Installation Process To install Tesseract 5 on Windows or Linux systems, one should download the appropriate package containing the Long Short-Term Memory (LSTM) neural network-based engine along with necessary `traineddata` files from official repositories[^1]. For installation via command line on Windows: ```bash choco install tesseract --version=5.x.x ``` On Debian/Ubuntu based distributions: ```bash sudo apt-get update && sudo apt-get install tesseract-ocr ``` Ensure that after installing, any additional language data required must be placed into the correct directory such as `C:\Program Files\Tesseract-OCR\tessdata` for Windows environments[^2]. #### Basic Command Line Usage The fundamental way to interact with Tesseract involves using simple commands through terminal interfaces: Basic syntax example: ```bash tesseract input_image output_base [-l lang] [--oem ocr_engine_mode] [--psm page_segmentation_mode] ``` Where `-l lang` specifies which language model to use during processing; common options include eng (English), chi_sim (Simplified Chinese). Example converting image.png file named document.txt while specifying English language: ```bash tesseract image.png document -l eng ``` Advanced configurations are also supported when integrating within programming languages like PHP where specific settings may apply[^4]: ```php $tesseract = new \TesseractOCR('image.png'); $tesseract->setLanguage('eng')->run(); ``` #### Language Support One notable feature of this software lies in its extensive multilingual capabilities supporting Unicode characters encoded under UTF-8 standard. Users have access to pre-trained datasets covering numerous scripts worldwide without requiring custom training processes unless specialized requirements exist beyond default offerings. --related questions-- 1. What changes were introduced specifically in Tesseract v5 compared to previous versions? 2. How does one go about adding support for unsupported languages in Tesseract? 3. Can you provide examples demonstrating how to integrate Tesseract API calls directly inside web applications built upon Node.js framework? 4. Are there any particular hardware specifications recommended for optimal operation of Tesseract OCR system especially concerning large scale deployments?
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值