Python 批量删除PDF中的特定图像

最新推荐文章于 2025-03-17 09:50:05 发布

Humbunklung

最新推荐文章于 2025-03-17 09:50:05 发布

阅读量1.5k

点赞数 8

分类专栏：学海泛舟文章标签： python pdf 开发语言

本文链接：https://blog.youkuaiyun.com/Humbunklung/article/details/137470062

版权

学海泛舟专栏收录该内容

105 篇文章

订阅专栏

从网上下载的一些PDF文件，里面添加了特定的背景图像，当文件有很多页的时候，手工删除并不现实。因此，需要用程序将这些图像批量删除。

一、需要依赖的包

MuPDF是一个不错的PDF文件处理库，在Python世界里面，对应的叫做PyMuPDF，它是一个相当给力的工具。它的官方文档如下：

https://pymupdf.readthedocs.io/en/latest/

我们通过pip命令安装它：

pip install --upgrade pymupdf

当本地Python环境已经存在该库时，会执行版本更新动作，保持最新版。

二、在代码中引用该包

通过import语句，导入该包：

import fitz

三、查找并删除特定的图片

1. 发现想要删的

通过查阅PyMuPDF的官方文档，我们知道，删除一个PDF里的图像可以通过Page类中的删除图像方法完成（根据文档介绍，这其实是拿一个空白的外参替换原有的，本质上并没有彻底删除，当然肉眼也已经不可见了，但用对象编辑工具可以看到一个透明框框）：

delete_image(xref)

又或者再彻底一点，来个“根治”的方法——Document类里的_deleteObject方法(该方法彻底删掉了对应的图像参照对象)：

doc._deleteObject(xref)

两个方法的输入参数xref是图像的外部参照ID号，如“622”。那么，我们就可以通过抽取PDF图像列表的方式，获得外部参照xref的ID，例如运行以下程序，可以从PDF中提取对应图像。

import io
import os
import sys
import time

import fitz

print(fitz.__doc__)

if not tuple(map(int, fitz.version[0].split("."))) >= (1, 18, 18):
    raise SystemExit("require PyMuPDF v1.18.18+")

dimlimit = 0  # 100  # each image side must be greater than this
relsize = 0  # 0.05  # image : image size ratio must be larger than this (5%)
abssize = 0  # 2048  # absolute image size limit 2 KB: ignore if smaller
imgdir = "output"  # found images are stored in this subfolder

if not os.path.exists(imgdir):  # make subfolder if necessary
    os.mkdir(imgdir)


def recoverpix(doc, item):
    xref = item[0]  # xref of PDF image
    smask = item[1]  # xref of its /SMask

    # special case: /SMask or /Mask exists
    if smask > 0:
        pix0 = fitz.Pixmap(doc.extract_image(xref)["image"])
        if pix0.alpha:  # catch irregular situation
            pix0 = fitz.Pixmap(pix0, 0)  # remove alpha channel
        mask = fitz.Pixmap(doc.extract_image(smask)["image"])

        try:
            pix = fitz.Pixmap(pix0, mask)
        except:  # fallback to original base image in case of problems
            pix = fitz.Pixmap(doc.extract_image(xref)["image"])

        if pix0.n > 3:
            ext = "pam"
        else:
            ext = "png"

        return {  # create dictionary expected by caller
            "ext": ext,
            "colorspace": pix.colorspace.n,
            "image": pix.tobytes(ext),
        }

    # special case: /ColorSpace definition exists
    # to be sure, we convert these cases to RGB PNG images
    if "/ColorSpace" in doc.xref_object(xref, compressed=True):
        pix = fitz.Pixmap(doc, xref)
        pix = fitz.Pixmap(fitz.csRGB, pix)
        return {  # create dictionary expected by caller
            "ext": "png",
            "colorspace": 3,
            "image": pix.tobytes("png"),
        }
    return doc.extract_image(xref)


fname = sys.argv[1] if len(sys.argv) == 2 else None
if not fname:
    fname = "your_default_pdf_path"
if not fname:
    raise SystemExit()

t0 = time.time()
doc = fitz.open(fname)

page_count = doc.page_count  # number of pages

xreflist = []
imglist = []
for pno in range(page_count):

    il = doc.get_page_images(pno)
    imglist.extend([x[0] for x in il])
    for img in il:
        xref = img[0]
        if xref in xreflist:
            continue
        width = img[2]
        height = img[3]
        if min(width, height) <= dimlimit:
            continue
        image = recoverpix(doc, img)
        n = image["colorspace"]
        imgdata = image["image"]

        if len(imgdata) <= abssize:
            continue
        if len(imgdata) / (width * height * n) <= relsize:
            continue

        imgfile = os.path.join(imgdir, "img%06i.%s" % (xref, image["ext"]))
        fout = open(imgfile, "wb")
        fout.write(imgdata)
        fout.close()
        xreflist.append(xref)

t1 = time.time()
imglist = list(set(imglist))
print(len(set(imglist)), "images in total")
print(len(xreflist), "images extracted")
print("total time %g sec" % (t1 - t0))

代码运行结果如下：

PyMuPDF 1.24.1: Python bindings for the MuPDF 1.24.1 library (rebased implementation).

Python 3.11 running on win32 (64-bit).

12 images in total

12 images extracted

total time 1.93884 sec

查看输出目录的图像，我需要删除当中ID为503（一串编码）和504（我的邮箱）的外参图像。

2. 删掉想要删的

找到我们想要删除的内容后，操作就变得简单了。我们通过遍历每一个Page，删掉对应的Image即可。

虚假的删除（只是替换）：

import fitz, os
doc = fitz.open("your pdf path")
for page in doc:
    # delete the specific images
    page.delete_image(503)
    page.delete_image(504)
# save the document
doc.save('tapmh.pdf')

真实的删除（直接撸掉）：

import fitz, os
doc = fitz.open("your pdf path")
# remove the xref objects
doc._deleteObject(503)
doc._deleteObject(504)
# save the document
doc.save("tapmh.pdf")

删除效果如下：