目录
一、相关库安装
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfplumber
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pypdf2
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfminer3k
1、PyPDF系列:
PyPDF2,PyPDF3, PyPDF4,主要对PDF进行操作:合并,拆分,旋转。
以下是常见测试代码:
pdfpath='D:/A_myfile/deep learning by shu008.pdf'
from PyPDF2 import PdfFileReader as reader,PdfFileWriter as writer
with open(pdfpath,'rb') as f:
pdf = reader(f)
infomation = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
txt = f'''{infomation} information:
Author : {infomation.author},
Creator : {infomation.creator},
Producer : {infomation.producer},
Subject : {infomation.subject},
Title : {infomation.title},
Number of pages : {number_of_pages}
'''
print(txt)
2、pdfplumber
获取PDF每页的每个文本字符、矩形和线条的详细信息。另外:表格提取和可视化调试。
常见测试代码:
import pdfplumber
pdf = pdfplumber.open(path)
import pandas as pd
for page in pdf.pages:
# 获取当前页面的全部文本信息,包括表格中的文字
# print(page.extract_text()) # 只提取文字,对表格信息,有简单合并行
# print(page.extract_words()) # 提取字符串的文本、坐标等信息
# print(page.extract_tables()) # 按行元素返回表格信息,无坐标
# print(page.chars) # 按字符而非字符串提取文本、坐标等信息
for t in page.extract_tables():
# for row in t:
# print(row)
# 得到的table是嵌套list类型,转化成DataFrame更加方便查看和分析
df = pd.DataFrame(t[1:], columns=t[0])
print(df)
# 只用第一页测试
break
pdf.close()
Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six
3、pdfminer3k
pdfminer 是python2库,pdfminer3k是python3库。相较于pdfplumber,操作繁琐。暂时没用上
dir(pdfminer)
pdfminer?
pdfminer??
二、主要代码
1、pdfplumber提取相关信息
根据不同PDF,编写目录信息、页码信息的代码。
import re
import pdfplumber
pdfpath='D:/A_myfile/deep learning by shu008.pdf'
with pdfplumber.open(pdfpath) as pdf:
cata_list = []
for page in pdf.pages:
text = page.extract_text() # 提取文本
if text.find('Chen Gong') != -1:
text = text.partition('Chen Gong') # 1、提取目录信息。
page_num = re.sub("\D", "", str(page)) # 2、提取目录所在页数
cata_list.append((text[0],page_num))
print(cata_list)
2、addBookmark
from PyPDF2 import PdfFileReader as reader,PdfFileWriter as writer
pdfpath='D:/A_myfile/deep learning by shu008.pdf'
pdf_in=reader(pdfpath)
pdf_out=writer()
pageCount=pdf_in.getNumPages()
#print(pageCount)
for iPage in range(pageCount):
pdf_out.addPage(pdf_in.getPage(iPage))
for elem in range(len(cata_list)):
page_name=cata_list[elem][0][:-1] # 目录信息
page_num=int(cata_list[elem][1]) # 页码信息
pdf_out.addBookmark(page_name,page_num-1,None)
outpath='D:/A_myfile/shu008目录版.pdf'
with open(outpath,'wb') as fout:
pdf_out.write(fout)
参考:1、用python依据txt文档给pdf添加目录以及超链接 - 知乎
2、最强的Python 办公自动化之 PDF 攻略来了(全)_菜鸟学Python的博客-优快云博客