原文地址链接:
http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C
Warning
警告
June 20, 2012: This may not be the best way to parse PDF files (at least not the most efficient one). PDFBox is a great Java library but the IKVM.NET bridge makes it a little slow.
这可能并不是转换PDF文件最好的方法(至少不是最高效的方法)。
PDFBox是一个很厉害的Java库,IKVM.NET桥使得它的效率有点减缓。
PS:PDFBox是Java中实现PDF到Text转换的库,IKVM.NET将该库进行封装提供C#接口,因此转换速度受到影响。
However, since a lot of people are still coming here for a PDF parsing solution (and it's been almost 7 years since this article was originally published), I have updated the article and the Visual Studio project so it works with the latest PDFBox version. It's also possible to download the project with all dependencies - something many people were struggling with.
然而,由于仍然有很多人来这里,为了得到一个PDF解析的解决办法(从这篇文章最初被发布到现在已经将近7年了),我更新了这篇文章,并升级了Visual Studio工程,使得它能够使用最新版的PDFBox。
How to parse PDF files
如何转换PDF文件
While extending the indexing solution for an intranet built using the Lucene.NET library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.
After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.
Using Adobe PDF IFilter
Using Adobe PDF IFilter requires:
使用Adobe PDF IFilter需要:
- Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome)and
使用不可靠的COM交互。
- A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
在目标系统上单独地安装Adobe IFilter。如果你需要发布你的解决方案到别人的机器上,这将是件痛苦的事情。
Using iTextSharp
iTextSharp is a .NET port ofiText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but there are some classes that allow you to read PDF - especiallyPdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF is not a simple format, thePDF Reference is 7 MB - compressed - PDF file). I was able to get toPdfArray, PdfBoolean, PdfDictionary and other objects but after some hours of trying to resolve PdfIndirectReference I gave up and threw away the iTextSharp based parser.
Finally: PDFBox
PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (seeLucenePDFDocument).
Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package).
Using PDFBox in .NET requires adding references to:
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.7.0.dll
and copying the following files the bin directory:
- commons-logging.dll
- fontbox-1.7.0.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
Using the PDFBox to parse PDFs is fairly easy:
private static string parseUsingPDFBox(string filename)
{
PDDocument doc = PDDocument.load(filename);
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}
The size of the required assemblies adds up to almost 18 MB:
- IKVM.OpenJDK.Core.dll (4 MB)
- IKVM.OpenJDK.SwingAWT.dll (6 MB)
- pdfbox-1.7.0.dll (4 MB)
- commons-logging.dll (82 kB)
- fontbox-1.7.0.dll (180 kB)
- IKVM.OpenJDK.Util.dll (2 MB)
- IKVM.Runtime.dll (1 MB)
The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 13 seconds.
Related information
- See this article (with future updates) at SquarePDF.NET.