python textract windows配置

最新推荐文章于 2024-06-19 22:30:46 发布

原创最新推荐文章于 2024-06-19 22:30:46 发布 · 1.3k 阅读

·

0

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

文章标签：

Python 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了如何在Windows环境下配置Python的Textract库，包括安装tesseract OCR、支持中文、pdf转换和doc文件处理。通过设置系统路径，添加必要的依赖组件，如pdftotext和antiword，实现从不同格式文件中提取文字。

部署运行你感兴趣的模型镜像

python textract能够帮助你从图片和各种文档识别文字

测试环境：

1. win7_64/win10_64

2. python3.7_64

oonnley.com - 算工资工具

textract安装

pip install extract

Textract dependencies

If you use pip install textract, then it will support to extract data from docx, xlsx, pptx.

If you want textract support OCR(optical character recognition), you need to install tesseract:

https://github.com/tesseract-ocr/tesseract/wiki

for windows installer:

https://github.com/UB-Mannheim/tesseract/wiki

after install tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe, need to add installation path (C:\Program Files\Tesseract-OCR) to system variables – Path

if you want to support Chinese, then need to download the trained data from:

https://github.com/tesseract-ocr/tessdata

and put downloaded file chi_sim.traineddata in C:\Program Files\Tesseract-OCR\tessdata

test with below commands:

import textract

text=textract.process('./test_image/1.tif',method='tesseract',language='chi_sim')

print(text)

If you want textract support pdf, you need to download the pdftotext component from:

http://blog.alivate.com.au/poppler-windows/

the latest version till now is: poppler-0.68.0_x86

unzip it, you can get the folder poppler-0.68.0, and put it in folder - C:\Program Files (x86)\

add path (C:\Program Files (x86)\poppler-0.68.0\bin) to system variables – Path

test with below commands:

import textract

text=textract.process('./test_image/1.pdf')

print(text)

If you want to support doc, you need to download the antiword component from:

https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Antiword.shtml

the latest version till now is: antiword-0_37-windows.zip

unzip it, you will get the folder antiword, and must put it at c:\(seems path set in app)

add path (C:\antiword) to system variables – Path

test with below commands:

import subprocess

text= subprocess.check_output(['antiword', '-m', 'utf-8.txt', './test_image/1.doc'])

print(text)

您可能感兴趣的与本文相关的镜像

Python3.10

Python3.10

Conda

Python

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本

博客等级

码龄18年

1
原创

0
点赞

2
收藏

0
粉丝

关注

私信

热门文章

python textract windows配置 1375

分类专栏

Python 1篇

最新文章

AI算力推荐

Python3.10

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本

Conda

Python

目录

展开全部

收起

评论

成就一亿技术人!

拼手气红包6.0元

还能输入1000个字符

添加红包

插入表情

表情包

代码片

HTML/XML
objective-c
Ruby
PHP
C
C++
JavaScript
Python
Java
CSS
SQL
其它

条评论被折叠查看

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。