Converting PDF to Text in C#(转换PDF为Text)

本文介绍了使用PDFBox库在C#中将PDF文件转换为文本的方法,对比了AdobePDFIFilter、iTextSharp和PDFBox的优劣,并提供了代码示例。重点突出了PDFBox在转换效率和依赖包管理上的优势。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原文地址链接:

http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C

 

Warning 

警告

June 20, 2012: This may not be the best way to parse PDF files (at least not the most efficient one). PDFBox is a great Java library but the IKVM.NET bridge makes it a little slow. 

这可能并不是转换PDF文件最好的方法(至少不是最高效的方法)。

PDFBox是一个很厉害的Java库,IKVM.NET桥使得它的效率有点减缓。

PS:PDFBox是Java中实现PDF到Text转换的库,IKVM.NET将该库进行封装提供C#接口,因此转换速度受到影响。

However, since a lot of people are still coming here for a PDF parsing solution (and it's been almost 7 years since this article was originally published), I have updated the article and the Visual Studio project so it works with the latest PDFBox version. It's also possible to download the project with all dependencies - something many people were struggling with. 

然而,由于仍然有很多人来这里,为了得到一个PDF解析的解决办法(从这篇文章最初被发布到现在已经将近7年了),我更新了这篇文章,并升级了Visual Studio工程,使得它能够使用最新版的PDFBox。

How to parse PDF files   

如何转换PDF文件

While extending the indexing solution for an intranet built using the Lucene.NET library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted. 

 

After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried. 

 

Using Adobe PDF IFilter 

Using Adobe PDF IFilter requires:

使用Adobe PDF IFilter需要:

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome)and

使用不可靠的COM交互。

  1. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else. 

在目标系统上单独地安装Adobe IFilter。如果你需要发布你的解决方案到别人的机器上,这将是件痛苦的事情。

Using iTextSharp 

iTextSharp is a .NET port ofiText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but there are some classes that allow you to read PDF - especiallyPdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF is not a simple format, thePDF Reference is 7 MB - compressed - PDF file). I was able to get toPdfArray, PdfBoolean, PdfDictionary and other objects but after some hours of trying to resolve PdfIndirectReference I gave up and threw away the iTextSharp based parser.

Finally: PDFBox  

PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (seeLucenePDFDocument).  

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package). 

Using PDFBox in .NET requires adding references to:  

  • IKVM.OpenJDK.Core.dll 
  • IKVM.OpenJDK.SwingAWT.dll 
  • pdfbox-1.7.0.dll 

and copying the following files the bin directory: 

  • commons-logging.dll 
  • fontbox-1.7.0.dll 
  • IKVM.OpenJDK.Util.dll 
  • IKVM.Runtime.dll 

Using the PDFBox to parse PDFs is fairly easy: 

private static string parseUsingPDFBox(string filename)
{
    PDDocument doc = PDDocument.load(filename);
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
}  


 

The size of the required assemblies adds up to almost 18 MB:

  • IKVM.OpenJDK.Core.dll (4 MB) 
  • IKVM.OpenJDK.SwingAWT.dll (6 MB) 
  • pdfbox-1.7.0.dll  (4 MB) 
  • commons-logging.dll (82 kB) 
  • fontbox-1.7.0.dll (180 kB) 
  • IKVM.OpenJDK.Util.dll (2 MB) 
  • IKVM.Runtime.dll (1 MB) 

The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 13 seconds. 

Related information 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值