python合并多个pdf文件

Python实现多个PDF文件合并

最新推荐文章于 2024-11-15 11:19:44 发布

翻译最新推荐文章于 2024-11-15 11:19:44 发布 · 5.2k 阅读

文章标签：

#python #pdf

python 专栏收录该内容

36 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

python合并多个pdf文件

假设您有个无聊的工作，将几十个PDF文档合并成一个PDF文件。他们每个都有封面页作为第一页，但你不希望在最终结果中重复覆盖表。即使有有很多免费的程序来组合PDF，其中许多只是合并整个文件在一起。让我们编写一个Python程序来自定义哪些页面你想要的是组合PDF。从高层次来看，这是程序将要做的事情：

查找当前工作目录中的所有PDF文件。
对文件名进行排序，以便按顺序添加PDF。
将每个PDF的每个页面（不包括第一页）写入输出文件。
在实现方面，您的代码需要执行以下操作：
调用 os.listdir() 来查找工作目录中的所有文件，删除所有非PDF文件。
调用Python的sort()列表方法来按字母顺序排列文件名。
为输出PDF创建PdfFileWriter对象。
遍历每个PDF文件，为其创建PdfFileReader对象。
在每个PDF文件中循环遍历每个页面（第一页除外）。
将页面添加到输出PDF。
将输出PDF写入名为allminutes.pdf的文件。
对于此项目，请打开一个新的文件编辑器窗口并将其另存为 “combinePdfs.py”

Step 1:找到所有的PDF文件

首先，您的程序需要获取所有扩展名为.pdf的文件的列表
当前的工作目录并对它们进行排序。让你的代码看起来像
以下：

在这里插入代码片

在shebang线和关于什么的描述性评论之后程序没有，这段代码导入了os和PyPDF2模块。该
os.listdir(’.’) 调用将返回当前工作中的每个文件的列表目录。代码循环遍历此列表，并仅添加带有.pdf扩展的那些文件pdfFiles。之后，此列表按字母顺序排序，使用key = str.lower关键字参数对sort() 进行排序。创建PdfFileWriter对象以保存组合的PDF页面。最后，一些评论概述了该计划的其余部分。

#! /usr/bin/python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.

import PyPDF2, os

# Get all the PDF filenames.
pdfFiles = []
for filename in os.listdir('.'):
    if filename.endswith('.pdf'):
        pdfFiles.append(filename)
pdfFiles.sort(key = str.lower)

pdfWriter = PyPDF2.PdfFileWriter()

# TODO: Loop through all the PDF files.

# TODO: Loop through all the pages (except the first) and add them.

# TODO: Save the resulting PDF to a file.

第二步：打开每一个 PDF 文件

现在程序必须读取pdfFiles中的每个PDF文件。添加以下内容：

#! /usr/bin/python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.

import PyPDF2, os

# Get all the PDF filenames.
pdfFiles = []
for filename in os.listdir('.'):
    if filename.endswith('.pdf'):
        pdfFiles.append(filename)
pdfFiles.sort(key = str.lower)

pdfWriter = PyPDF2.PdfFileWriter()

# Loop through all the PDF files.
for filename in pdfFiles:
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    # TODO: Loop through all the pages (except the first) and add them.

# TODO: Save the resulting PDF to a file.

对于每个PDF，循环通过以读二进制模式(以’rb’作为第二个参数)调用open() 。 open()调用返回一个File对象，它被传递给PyPDF2.PdfFileReader() 。

第三步: 添加每一页

对于每个PDF，您都希望遍历除第一个页面之外的每个页面。加上这个代码到你的程序：

#! /usr/bin/python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.

import PyPDF2, os

# Get all the PDF filenames.
pdfFiles = []
for filename in os.listdir('.'):
    if filename.endswith('.pdf'):
        pdfFiles.append(filename)
pdfFiles.sort(key = str.lower)

pdfWriter = PyPDF2.PdfFileWriter()

# Loop through all the PDF files.
for filename in pdfFiles:
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    # Loop through all the pages (except the first) and add them.
    for pageNum in range(1, pdfReader.numPages):
        pageObj = pdfReader.getPage(pageNum)
        pdfWriter.addPage(pageObj)



# TODO: Save the resulting PDF to a file.

for循环中的代码将每个Page对象分别复制到PdfFileWriter对象。请记住，您想跳过第一页。以来
PyPDF2认为0是第一页，你的循环应该从1 开始，然后转到但不包括pdfReader.numPages中的整数。

第四步: 保存结果

在这些嵌套的for循环完成循环之后，pdfWriter变量将会循环包含PdfFileWriter对象，其中包含所有PDF的页面。最后一步是将此内容写入硬盘驱动器上的文件。将此代码添加到你程序中：

#!/usr/bin/python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.

import PyPDF2, os

# Get all the PDF filenames.
pdfFiles = []
for filename in os.listdir('/home/hux/books/python'):
    if filename.endswith('.pdf'):
        pdfFiles.append('/home/hux/books/python/'+filename)
pdfFiles.sort(key = str.lower)

pdfWriter = PyPDF2.PdfFileWriter()

# Loop through all the PDF files.
for filename in pdfFiles:
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
    for pageNum in range(1, pdfReader.numPages):
        pageObj = pdfReader.getPage(pageNum)
        pdfWriter.addPage(pageObj)

pdfOutput = open('allminutes.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()