tesseract杂项

最新推荐文章于 2023-03-06 16:32:20 发布

yasi_xi

最新推荐文章于 2023-03-06 16:32:20 发布

阅读量2k

点赞数

本文详细介绍了如何解决Tesseract OCR在Visual Studio 2008中遇到的中文字符编码问题，并提供了路径配置的方法。包括安装更新、编码设置、错误解决步骤等，确保了OCR识别过程的顺利进行。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这里有leptonica路径配置：http://www.myexception.cn/vsts/1118613.html

在VS2010上编译和使用tesseract_ocr识别验证码，以及下面的问题：

VS2008中，有中文字符编码的源文件编译报错：ccmain/equationdetect.cpp

if (unicharset.get_ispunctuation(id)) {
// Exclude some special texts that are likely to be confused as math symbol.
static GenericVector<UNICHAR_ID> ids_to_exclude;
if (ids_to_exclude.empty()) {
static const STRING kCharsToEx[] = {"'", "`", "\"", "\\", ",", ".",
"〈", "〉", "《", "》", "」", "「", ""};
1>..\..\ccmain\equationdetect.cpp(251) : warning C4819: The file contains a character that cannot be represented in the current code page (936). Save the file in Unicode format to prevent data loss
1>..\..\ccmain\equationdetect.cpp(251) : error C2146: syntax error : missing '}' before identifier '銆'

1>..\..\ccmain\equationdetect.cpp(251) : error C2146: syntax error : missing ';' before identifier '銆'
1>..\..\ccmain\equationdetect.cpp(251) : error C2065: '銆' : undeclared identifier
1>..\..\ccmain\equationdetect.cpp(251) : error C2146: syntax error : missing ';' before identifier '銆'
1>..\..\ccmain\equationdetect.cpp(251) : error C2065: '銆' : undeclared identifier
1>..\..\ccmain\equationdetect.cpp(251) : error C2146: syntax error : missing ';' before identifier '銆'
1>..\..\ccmain\equationdetect.cpp(251) : error C2065: '銆' : undeclared identifier
1>..\..\ccmain\equationdetect.cpp(251) : error C2143: syntax error : missing ';' before '}'
1>..\..\ccmain\equationdetect.cpp(253) : error C2065: 'kCharsToEx' : undeclared identifier

~~Solution：http://www.myexception.cn/vsts/1118613.html，下载安装VS2008SP1：Microsoft Visual Studio 2008 Service Pack 1 (installer)~~

~~Solution：VC++ warning C4819 的解决方法~~

Solution：全选编码出错的源文件，在VS2005中，File => Advanced Save Options => Encoding选择“Chinese Simplified (GB2312) - Codepage 936”

Build classifier_tester、cntraining、combine_tessdata、dawg2wordlist、mftraining、shapeclustering工程时，遇到下面的错误：

error PRJ0002 : Error result 31 returned from 'C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin\mt.exe'

解决：Properties -> Configuration Properties -> Linker -> Manifest File, set Generate Manifest to No

Building the training applications （以下工程只能以LIB_Debug 或LIB_Release方式build，用DLL方式build会出错，官方已经做了申明）

The training related applications are built using the following projects:

 
  ambiguous_words
classifier_tester
cntraining
combine_tessdata
dawg2wordlist
mftraining
shapeclustering
unicharset_extractor
wordlist2dawg
 
 

Note

Currently these applications can ONLY be built with the LIB_Debug and LIB_Release configurations. If you try to use a DLL configuration you’ll get “undefined external symbol” errors.

Tesseract 官方申明，Tesseract 3.0 只能处理utf-8编码的字符

Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. Tesseract 3.0 can handle any Unicode characters (coded with UTF-8), but there are limits as to the range of languages that it will be successful with, so please take this section into account before building up your hopes that it will work well on your particular language!

Tesseract is slower with large character set languages (like Chinese), but it seems to work OK.

Any language that has different punctuation and numbers is going to be disadvantaged by some of the hard-coded algorithms that assume ASCII punctuation and digits. To be fixed in 3.0x for x>=2.

关于路径配置：

It’s usually better to make a separate directory to test tesseract.exe. To run tesseract, you either need to make sure your test directory contains the tessdatatesseract language data folder or you set the TESSDATA_PREFIX environment variable to point to it. See http://code.google.com/p/tesseract-ocr/wiki/ReadMe for important details.

For example, you can use the following directory structure:

 
  C:\BuildFolder\
  include\
  lib\
  tesseract-3.02\
  testing\
     tessdata\
 
 

Copy your tesseract executable to C:\BuildFolder\testing. If you built a DLL version then be sure to also copy the required DLLs to the same directory (or addC:\BuildFolder\lib to your PATH – However, this isn’t really recommended).

For example, if you are trying to run tesseractd.exe then you’ll need to also copy the following to C:\BuildFolder\testing:

 
  liblept168d.dll
libtesseract302d.dll

sudo apt-get update

sudo apt-get install autoconf automake libtool
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlib1g-dev

dpkg -l | grep xxx

$ sudo vi /etc/profile
在最后加入PATH的设置如下：
export PATH=”$PATH:your path1:your path2 ...”
export TESSDATA_PREFIX="/yasi/testing"
source /etc/profile // 让环境变量的修改生效

在 http://www.boost.org/boost-build2/ 下载boost.build时，不要点 Download: [zip] , [tar.bz2] 来下载，因为可能这些包中没有 bootstrap.sh 文件；点 Nightly build: [zip], [tar.bz2] 下载。

只拿回boost.build:

./bootstrap.sh
./b2 install --prefix=PREFIX

(PREFIX is a directory where you want Boost.Build to be installed. Optionally, add PREFIX/bin to your PATH environment variable)

安装boost-build最好要配prefix，而不要用默认的配置；这样路径都是自己配的，可以清楚知道是否正确安装了；另外要记着配相应的环境变量。

build unicomm 执行 bash.sh xxx 之前，一定要要把 unicomm 目录下所有linux脚本文件都加可执行权限，尤其是 build.bash 和 build/build.bash

./build.sh toolset=gcc variant=debug link=shared --prefix="/yasi/libunicomm-1.00c-installed" --use-pre-built-boost=1.53.0

==============================

编译libsmart：

1. 先从http://xerces.apache.org/xerces-c/download.cgi下载xerces-c_2_8_0-x86-linux-gcc_3_4.tar.gz。可以下载已经编译好的二进制库。解压到/yasi目录下。

2. cd /yasi/libsmart-1.01d/build/dependencies修改文件xercesc_root-linux，将路径改为/yasi/xerces-c_2_8_0-x86-linux-gcc_3_4。该目录下的boost_root-linux文件中的路径也要改为/yasi/boost_1_53_0。

3. cd/yasi/libsmart-1.01d,执行build_smart.sh即可编译成功。