python textract windows配置

本文介绍了如何在Windows环境下配置Python的Textract库,包括安装tesseract OCR、支持中文、pdf转换和doc文件处理。通过设置系统路径,添加必要的依赖组件,如pdftotext和antiword,实现从不同格式文件中提取文字。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

python textract能够帮助你从图片和各种文档识别文字

测试环境:

1. win7_64/win10_64

2. python3.7_64

3.test_image

 

oonnley.com - 算工资工具

 

textract安装

pip install extract

Textract dependencies

If you use pip install textract, then it will support to extract data from docx, xlsx, pptx.

  1. If you want textract support OCR(optical character recognition), you need to install tesseract:

https://github.com/tesseract-ocr/tesseract/wiki

for windows installer:

https://github.com/UB-Mannheim/tesseract/wiki

after install tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe, need to add installation path (C:\Program Files\Tesseract-OCR) to system variables – Path

if you want to support Chinese, then need to download the trained data from:

https://github.com/tesseract-ocr/tessdata

and put downloaded file chi_sim.traineddata in C:\Program Files\Tesseract-OCR\tessdata

test with below commands:

import textract

text=textract.process('./test_image/1.tif',method='tesseract',language='chi_sim')

print(text)

 

  1. If you want textract support pdf, you need to download the pdftotext component from:

http://blog.alivate.com.au/poppler-windows/

the latest version till now is: poppler-0.68.0_x86

unzip it, you can get the folder poppler-0.68.0, and put it in folder - C:\Program Files (x86)\

add path (C:\Program Files (x86)\poppler-0.68.0\bin) to system variables – Path

 

test with below commands:

import textract

text=textract.process('./test_image/1.pdf')

print(text)

 

  1. If you want to support doc, you need to download the antiword component from:

https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Antiword.shtml

the latest version till now is: antiword-0_37-windows.zip

unzip it, you will get the folder antiword, and must put it at c:\(seems path set in app)

add path (C:\antiword) to system variables – Path

test with below commands:

import subprocess

text= subprocess.check_output(['antiword', '-m', 'utf-8.txt', './test_image/1.doc'])

print(text)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值