pdfminer错误提交

在尝试使用pdfminer.six库进行PDF文本提取时遇到错误。具体表现为在处理某些PDF文件时,出现AttributeError,指出'PDFStream'对象没有'replace'属性。此外,还遇到了PDFTextExtractionNotAllowed错误,提示不允许从特定PDF中提取文本。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

https://github.com/pdfminer/pdfminer.six/issues


pdf: https://links.sgx.com/1.0.0/corporate-announcements/HOBG2B5Y0EVJ9PYQ/Manhattan%20Resources%20Limited%20-%20Offer%20Information%20Statement%20dated%2027%20November%202018.pdf

python ${pdfminer_path}/pdf2txt.py -M 99 -L 1 -o "/pdf/L02/HOBG2B5Y0EVJ9PYQ.txt" "/L02/HOBG2B5Y0EVJ9PYQ.pdf"

Traceback (most recent call last):
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts/pdf2txt.py", line 132, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts/pdf2txt.py", line 127, in main
    outfp = extract_text(**vars(A))
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts/pdf2txt.py", line 62, in extract_text
    pdfminer3.high_level.extract_text_to_fp(fp, **locals())
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/high_level.py", line 79, in extract_text_to_fp
    interpreter.process_page(page)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 851, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 861, in render_contents
    self.init_resources(resources)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 361, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 211, in get_font
    font = self.get_font(None, subspec)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdfinterp.py", line 202, in get_font
    font = PDFCIDFont(self, spec)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/pdffont.py", line 656, in __init__
    self.cmap = CMapDB.get_cmap(name)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/cmapdb.py", line 257, in get_cmap
    data = klass._load_data(name)
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer3/cmapdb.py", line 231, in _load_data
    name = name.replace("\0", "")
AttributeError: 'PDFStream' object has no attribute 'replace'

pdf: http://www3.hkexnews.hk/listedco/listconews/SEHK/2019/0121/LTN20190121455.pdf

python ${pdfminer_path}/pdf2txt.py  -o "/pdf/00137/LTN20190121455.txt" "/pdf/00137/LTN20190121455.pdf"

Traceback (most recent call last):
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts//pdf2txt.py", line 136, in <module>
    if __name__ == '__main__': sys.exit(main())
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts//pdf2txt.py", line 131, in main
    outfp = extract_text(**vars(A))
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/EGG-INFO/scripts//pdf2txt.py", line 63, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer/high_level.py", line 80, in extract_text_to_fp
    check_extractable=True):
  File "/appvol/cnam/anaconda3/lib/python3.6/site-packages/pdfminer.six-20181108-py3.6.egg/pdfminer/pdfpage.py", line 132, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
pdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed: <_io.BufferedReader name='/appvol/selenium/hkex/pdf/00137/LTN20190121455.pdf'>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值