python操作PDF文件(转文本、图片)

本文介绍了一种将PDF文件批量转换为JPG图片的方法,利用wand库进行转换,并解决了Ubuntu环境下遇到的Imagick-ImagickException错误,通过修改policy.xml文件配置实现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

最近在批量处理pdf文件。由于pdf中的格式比较多,所以在处理上有一定的难度。
我这里的思路是:如果能分析出文本就提取文本,如果不能分析出文本就转为jpg。(因为后面我还要处理jpg文件,所以也不在乎多一些)
因为之前提到过处理pdf文本的,所以这里就不在赘述。就说说把pdf中每页的内容抓为jpg格式吧。
这里因为要处理图片,所以用到了wand这个库

安装:pip3 install wand

然后就可以编写脚本来转换

from wand.image import Image
# 将pdf文件转为jpg图片文件
# ./PDF_FILE_NAME 为pdf文件路径和名称
def pdf2jpg(filename):
    image_pdf = Image(filename=filename,resolution=300)
    image_jpeg = image_pdf.convert('jpg')
    # wand已经将PDF中所有的独立页面都转成了独立的二进制图像对象。我们可以遍历这个大对象,并把它们加入到req_image序列中去。
    req_image = []
    for img in image_jpeg.sequence:
        img_page = Image(image=img)
        req_image.append(img_page.make_blob('jpg'))
    # 遍历req_image,保存为图片文件
    i = 0
    for img in req_image:
        ff = open(str(i)+'.jpg','wb')
        ff.write(img)
        ff.close()
        i += 1

filename='/home/filename.pdf'
pdf2jpg(filename)

然后运行,嘿嘿,报错:Imagick - ImagickException not… @ error / construct.c / ReadImage / 412错误
然后再百度了一下,我这边是ubuntu,cd到/etc/ImageMagick-6
然后用vi 打开policy.xml
编辑内容如下(其实与它里面原本的内容并无多大区别,主要是最后中的内容)
然后保存退出,再次运行,啦啦啦啦,prefect!!!

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE policymap [
<!ELEMENT policymap (policy)+>
<!ELEMENT policy (#PCDATA)>
<!ATTLIST policy domain (delegate|coder|filter|path|resource) #IMPLIED>
<!ATTLIST policy name CDATA #IMPLIED>
<!ATTLIST policy rights CDATA #IMPLIED>
<!ATTLIST policy pattern CDATA #IMPLIED>
<!ATTLIST policy value CDATA #IMPLIED>
]>
<!--
  Configure ImageMagick policies.

  Domains include system, delegate, coder, filter, path, or resource.

  Rights include none, read, write, and execute.  Use | to combine them,
  for example: "read | write" to permit read from, or write to, a path.

  Use a glob expression as a pattern.

  Suppose we do not want users to process MPEG video images:

    <policy domain="delegate" rights="none" pattern="mpeg:decode" />

  Here we do not want users reading images from HTTP:

    <policy domain="coder" rights="none" pattern="HTTP" />

  Lets prevent users from executing any image filters:

    <policy domain="filter" rights="none" pattern="*" />

  The /repository file system is restricted to read only.  We use a glob
  expression to match all paths that start with /repository:

    <policy domain="path" rights="read" pattern="/repository/*" />

  Any large image is cached to disk rather than memory:

    <policy domain="resource" name="area" value="1GB"/>

  Define arguments for the memory, map, area, and disk resources with
  SI prefixes (.e.g 100MB).  In addition, resource policies are maximums for
  each instance of ImageMagick (e.g. policy memory limit 1GB, -limit 2GB
  exceeds policy maximum so memory limit is 1GB).
-->
<policymap>
  <!-- <policy domain="resource" name="temporary-path" value="/tmp"/> -->
  <!-- <policy domain="resource" name="memory" value="2GiB"/> -->
  <!-- <policy domain="resource" name="map" value="4GiB"/> -->
  <!-- <policy domain="resource" name="area" value="1GB"/> -->
  <!-- <policy domain="resource" name="disk" value="16EB"/> -->
  <!-- <policy domain="resource" name="file" value="768"/> -->
  <!-- <policy domain="resource" name="thread" value="4"/> -->
  <!-- <policy domain="resource" name="throttle" value="0"/> -->
  <!-- <policy domain="resource" name="time" value="3600"/> -->
  <!-- <policy domain="system" name="precision" value="6"/> -->
  <policy domain="cache" name="shared-secret" value="passphrase"/>
  <policy domain="coder" rights="none" pattern="EPHEMERAL" />
  <policy domain="coder" rights="none" pattern="URL" />
  <policy domain="coder" rights="none" pattern="HTTPS" />
  <policy domain="coder" rights="none" pattern="MVG" />
  <policy domain="coder" rights="none" pattern="MSL" />
  <policy domain="coder" rights="none" pattern="TEXT" />
  <policy domain="coder" rights="none" pattern="SHOW" />
  <policy domain="coder" rights="none" pattern="WIN" />
  <policy domain="coder" rights="none" pattern="PLT" />
  <policy domain="path" rights="none" pattern="@*" />
</policymap> 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Mick Schumacher

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值