pdfminer API介绍:pdf网页爬虫

本文介绍了PDFMiner工具的安装与使用方法,并通过实例演示了如何从PDF文档中提取文本信息及其它内容。涵盖命令行操作与Python编程两种方式。

  安装 pip install pdfminer

  爬取数据是数据分析项目的第一个阶段,有的加密成pdf格式的文件,下载后需要解析,使用pdfminer工具。

  先介绍一下什么是pdfminer

  下面是官方一段英文介绍:

      PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

   主要用两个例子学习它的使用

  例子1:

$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
(extract text as an HTML file whose filename is output.html)

$ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf
(extract a Japanese HTML file in vertical writing, CMap is required)

$ pdf2txt.py -P mypassword -o output.txt secret.pdf
(extract a text from an encrypted PDF file)

  参数:

 -o filename
    Specifies the output file name. By default, it prints the extracted contents to stdout in text format.

-p pageno[,pageno,...]
    Specifies the comma-separated list of the page numbers to be extracted. Page numbers start at one. By default, it extracts text from all the pages.

-c codec
    Specifies the output codec.

-t type
    Specifies the output format. The following formats are currently supported.

        text : TEXT format. (Default)
        html : HTML format. Not recommended for extraction purposes because the markup is messy.
        xml : XML format. Provides the most information.
        tag : "Tagged PDF" format. A tagged PDF has its own contents annotated with HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations. Tags used here are defined in the PDF specification (See §10.7 "Tagged PDF"). 

-I image_directory
    Specifies the output directory for image extraction. Currently only JPEG images are supported.

-M char_margin 

 

例子2:

$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)

参数:

 -a
    Instructs to dump all the objects. By default, it only prints the document trailer (like a header).

-i objno,objno, ...
    Specifies PDF object IDs to display. Comma-separated IDs, or multiple -i options are accepted.

-p pageno,pageno, ...
    Specifies the page number to be extracted. Comma-separated page numbers, or multiple -p options are accepted. Note that page numbers start at one, not zero.

-r (raw)
-b (binary)
-t (text)
    Specifies the output format of stream contents. Because the contents of stream objects can be very large, they are omitted when none of the options above is specified.

    With -r option, the "raw" stream contents are dumped without decompression. With -b option, the decompressed contents are dumped as a binary blob. With -t option, the decompressed contents are dumped in a text format, similar to repr() manner. When -r or -b option is given, no stream header is displayed for the ease of saving it to a file.

-T
    Shows the table of contents. 

 

 

编写自己的pdf解析文档:

# -*- coding: utf-8 -*-   
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import *
from pdfminer.converter import PDFPageAggregator
import os
# os.chdir(r'F:\test')
fp = open('PDF/1202268749.pdf', 'rb')
#来创建一个pdf文档分析器
parser = PDFParser(fp)  
#创建一个PDF文档对象存储文档结构
document = PDFDocument(parser)
# 检查文件是否允许文本提取
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
else:
    # 创建一个PDF资源管理器对象来存储共赏资源
    rsrcmgr=PDFResourceManager()
    # 设定参数进行分析
    laparams=LAParams()
    # 创建一个PDF设备对象
    # device=PDFDevice(rsrcmgr)
    device=PDFPageAggregator(rsrcmgr,laparams=laparams)
    # 创建一个PDF解释器对象
    interpreter=PDFPageInterpreter(rsrcmgr,device)
    # 处理每一页
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        # 接受该页面的LTPage对象
        layout=device.get_result()
        for x in layout:
            if(isinstance(x,LTTextBoxHorizontal)):
                with open('a.html','a') as f:
                    f.write(x.get_text().encode('utf-8')+'\n')

 

参考:

pdfminer官网:  http://www.unixuser.org/~euske/python/pdfminer/index.html

http://www.cnblogs.com/RoundGirl/p/4979267.html

 

转载于:https://www.cnblogs.com/rongyux/p/5445723.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值