tesseract-ocr 的使用

最新推荐文章于 2025-02-21 16:29:24 发布

lrzss

最新推荐文章于 2025-02-21 16:29:24 发布

阅读量1.8k

点赞数

文章标签：人工智能

本文链接：https://blog.youkuaiyun.com/lrzss/article/details/130174133

版权

默认已经安装好tesseract-ocr，并下载好了中文支持包放在tessdata文件夹中，在cmd中输入tesseract --help-extra ，输出如下：

输入命令 tesseract --help-extra
用法:
    Tesseract --help | --help-extra | --help-psm | --help-oem | --version
    Tesseract --list-langs [--tessdata-dir PATH]
    Tesseract --print-parameters [options...] [configfile...]
    Tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

光学字符识别选项:
    - tessdata-dir PATH     指定tessdata路径的位置。
    --user-words PATH       指定用户词文件的位置。
    --user-patterns PATH    指定用户词文件的位置。
    -l LANG[+LANG]          指定用于光学字符识别的语言，可以加多个语言包

    - psm NUM  Specify page segmentation mode.（指定页面分段模式。）
    - oem NUM  Specify OCR Engine mode.（指定光学字符识别引擎模式。）
注意:这些选项必须出现在任何配置文件之前。

Page segmentation modes:页面分割模式
     0 仅定向和脚本检测(OSD)。
     1 带OSD的自动页面分割。
     2 自动页面分割，但没有OSD，或OCR。(未实施)
     3 全自动页面分割，但没有OSD。(默认)
     4 假设有一列不同大小的文本。
     5 假设有一个垂直对齐的统一文本块。
     6 假设有一个统一的文本块。
     7 将图像视为单个文本行。
     8 将图像视为一个单词。
     9 将图像视为圆圈中的一个单词。
    10 将图像视为单个字符。
    11 稀疏文本。不按特定顺序查找尽可能多的文本。
    12 带有OSD的稀疏文本。
    13 原始线。将图像视为单个文本行，绕过特定Tesseract的处理。

OCR Engine modes:光学字符识别引擎模式
    0 仅旧引擎。
    1 仅神经网络LSTM引擎。
    2 台传统+ LSTM发动机。
    3 默认值，基于可用的内容。

单一选项:
    -h, --help      显示最少的帮助消息。
    --help-extra    显示高级用户的额外帮助。
    -v, --version   显示版本信息。
    -- list-langs   列出可用于tesseract引

在cmd中使用tesseract-ocr：

1、tesseract imgPath savePath

如：tesseract 1.jpg result

2、tesseract imgPath savePath -l traineddata 指定语言包

如：tesseract 1.jpg result -l eng

eng--英语

chi_sim--简体中文

eng+chi_sim---英文+简体中文

在pycharm上使用：

pytesseract.pytesseract.tesseract_cmd = 'D:/software/tesseract-ocr/tesseract.exe'

1、查看可用的语言包

print(pytesseract.get_languages())

2、pytesseract.image_to_boxes()

pytesseract.image_to_string()

pytesseract.image_to_data() ......等的使用，参数大体差不多，如下：

image_to_string: 参数(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0) -> (bytes | str)
    
    image 对象或字符串 - 要由 Tesseract 处理的图像的 PIL Image/NumPy 数组或文件路径。如果您传递对象而不是文件路径，pytesseract 将隐式地将图像转换为RGB 模式。
    lang String - Tesseract 语言代码字符串。如果未指定，则默认为eng！多语言示例：lang='eng+chi_sim'
    config String - pytesseract 函数无法使用的任何其他自定义配置标志。例如：config='--psm 6'
    nice Integer - 修改 Tesseract 运行的处理器（指的应该是CPU与GPU）优先级。在 Windows 上不支持。Nice 调整了class Unix 进程的良好程度。
    output_type类属性 - 指定输出的类型，默认为string。有关所有支持类型的完整列表，查看pytesseract.Output类的定义。
    timeout Integer 或 Float - OCR 处理的持续时间（以秒为单位），之后，pytesseract 将终止并引发 RuntimeError。

'''psm---页面分段模式，8表示把图片视为一个单词
   oem---字符识别引擎模式，3表示传统+LSTM
   timeout---超时报错RuntimeError
   outputbase---输出文件的基本名称（将附加相应的扩展名）
'''

try:
    pytesseract.image_to_string(img,lang='chi_sim+eng', config='--psm 8 --oem 3 outputbase digits',timeout=2)
except RuntimeError as timeout_error:
    print("超时了")

让图片上的输出显示中文：

def output_ChineseText_display(img, text, position, textColor=(0, 255, 0), textSize=30):
    if (isinstance(img, np.ndarray)):  # 判断是否OpenCV图片类型
        img = Image.fromarray(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    # 创建一个可以在给定图像上绘图的对象
    draw = ImageDraw.Draw(img)
    # 字体的格式
    fontStyle = ImageFont.truetype("simsun.ttc", textSize, encoding="utf-8")
    # 绘制文本
    draw.text(position, text, textColor, font=fontStyle)
    # 转换回OpenCV格式
    return cv2.cvtColor(np.asarray(img), cv2.COLOR_RGB2BGR)

pytesseract.pytesseract.tesseract_cmd = 'D:/software/tesseract-ocr/tesseract.exe'
img = cv2.imread('E:/work/images/store_s.jpg')
hImg, wImg,_ = img.shape
boxes = pytesseract.image_to_boxes(img,lang="chi_sim")

for b in boxes.splitlines():
    b = b.split(' ')
    print(b)
    x, y, w, h = int(b[1]), int(b[2]), int(b[3]), int(b[4])
    cv2.rectangle(img, (x,hImg- y), (w,hImg- h), (50, 50, 255), 2)
    # cv2.putText(img,b[0],(x,hImg- y+25),cv2.FONT_HERSHEY_SIMPLEX,1,(50,50,255),2)  # 不能显示中文
    img = output_ChineseText_display(img,b[0],(x,hImg- y))  # 显示中文
cv2.imshow('img', img)
cv2.waitKey()