在Ubuntu中扫描英文文档

Tesseract是一款由HP实验室研发并开源的OCR识别引擎,以其高精度著称。它支持英语、德语等七种语言的文字识别,并可通过安装额外词典来增加支持的语言种类。该工具目前仅提供命令行版本,输入文件格式限制为TIFF,不支持布局分析。本文还提供了使用GIMP准备TIFF图像的方法及处理多页PDF文档的脚本。

Tesseract

Introduction

Arguably the one producing the best (most accurate) results is Tesseract. It is a technology initially developed by HP Labs between 1985 and 1995, then they open-sourced it in 2005.

Tesseract can recognize text in 7 different languages: English, German, French, Italian, Spanish, Brazilian Portuguese and Dutch. You can install more than one dictionary if needed.

It does not support layout analysis, so multi-column text, images, equations etc. should give you a garbled text output. Also, it only supports TIFF images as input.

Usage

Tesseract is currently a command-line-only tool (although they're working on an integration with OCROpus for a GUI). After successful installation, the command to use is tesseract<pathtotiffimage><outputfile> . Tesseract will automatically give the output file a .txt extension.

It is critical that the tiff image have a ".tif" extension and not a ".tiff" extension. The command line should look like this example:

$tesseract~/input.tifoutput

Where input.tif is the document to be converted located in your home folder and output is the document that Tesseract will create as output.txt . The .txt file extension will be added by Tesseract automatically.

Preparing images for Tesseract

Tesseract is not very flexible about the format of its input images. It will only accept TIFF images. According to user reports, compressed TIFF images are quite problematic, and the same goes for grey-scale and colour images. So you're better of with single-bit uncompressed TIFF images.

The process to prepare them with GIMP is very simple:

  1. Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode.
  2. Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value.
  3. Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering.
  4. Save the image in TIFF format with a .tif extension.

Using Tesseract With a Multi Page PDF

Often, scanned documents are stored as a raster image in a large PDF document. Using ImageMagick , the individual pages can then be extracted as TIFF files for processing using Tesseract. The following script can help automate this process:

#!/bin/sh
PAGES=100 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)

touch $OUTPUT
for i in `seq 1 $PAGES`; do
convert -monochrome -density $RESOLUTION $SOURCE/[$(($i - 1 ))/] page$i.tif
tesseract page$i.tif page$i
cat $OUTPUT page$i.txt > temp.txt
rm $OUTPUT
rm page$i.tif
rm page$i.txt
mv temp.txt $OUTPUT
done

After running this script, the OCR text should be contained in book.txt (or whatever you set $OUTPUT to be).

来自:

https://help.ubuntu.com/community/OCR

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值