使用PDFBox解析pdf文件

最新推荐文章于 2025-07-01 16:48:21 发布

battlehawk

最新推荐文章于 2025-07-01 16:48:21 发布

阅读量6.2k

点赞数

CC 4.0 BY-SA版权

分类专栏： framework--java-PDFBox 文章标签： exception string image file class java

本文链接：https://blog.youkuaiyun.com/battlehawk/article/details/4229840

framework--java-PDFBox 专栏收录该内容

1 篇文章

订阅专栏

本文介绍如何使用PDFBox工具解析PDF文件，包括提取文本内容到TXT文件、分离图片资源及转换PDF页面为JPEG等图像格式的方法。提供了具体的Java代码示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近碰到需要解析pdf文件，将pdf文件中的文字，图片解析出来，例如文字解析放入word文档或者txt文本，图片单独提取出来，网上查询资料，较好的解决方案是使用PDFBox这个工具，也能较好的解决中文问题，不过有些pdf还是不能完全正确解析，和具体pdf文件有关，具体看以下代码，使用的PDFBox版本为0.7.3

1.解析文本：

/** * @author battlehawk * Version 1.0 * Create Date 2009-5-28 */ public class Pdf2text { public static String getTxt(File f) throws Exception { String ts=""; try{ String temp = ""; PDDocument pdfdocument = PDDocument.load(f); ByteArrayOutputStream out = new ByteArrayOutputStream(); OutputStreamWriter writer = new OutputStreamWriter(out); PDFTextStripper stripper = new PDFTextStripper(); stripper.writeText(pdfdocument.getDocument(), writer); pdfdocument.close(); out.close(); writer.close(); byte[] contents = out.toByteArray(); ts = new String(contents); System.out.println(f.getName() + "length is:" + contents.length + "/n"); }catch(Exception e){ e.printStackTrace(); } finally{ return ts; } } public static void main(String[] args){ File file = new File("c:/数据库image.pdf"); try { System.out.println(Pdf2text.getTxt(file)); } catch (Exception e) { // TODO 自动生成 catch 块 e.printStackTrace(); } } }

2.使用PDFToImage.main(String[] args)方法提取图片，内容如果有中文会抛出异常

Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider

源码中的main方法注释

    private static void usage()
    {
        System.err.println( "Usage: java org.pdfbox.ExtractImages [OPTIONS] <PDF file>/n" +
            " -password <password>        Password to decrypt document/n" +
            " -prefix <image-prefix>      Image prefix(default to pdf name)/n" +
            " <PDF file>                   The PDF document to use/n"
            );
        System.exit( 1 );
    }

public class Pdf2Image { public static void main(String[] args){ String pdfUrl = "C://PDF格式的特点.pdf"; String[] s = {pdfUrl}; try { ExtractImages.main(s); } catch (Exception e) { e.printStackTrace(); } } }

3.将pdf转换为图片，例如jpg格式图片

public class Pdf2Image { public static void main(String[] args){ String pdfUrl = "C://PDF格式的特点.pdf"; String[] s = {"-imageType","jpg",pdfUrl}; try { PDFToImage.main(s); } catch (Exception e) { e.printStackTrace(); } } }

源码中的main方法注释

    private static void usage()
    {
        System.err.println( "Usage: java org.pdfbox.PDFToImage [OPTIONS] <PDF file>/n" +
            " -password <password>          Password to decrypt document/n" +
            " -imageType <image type>        (" + getImageFormats() + ")/n" +
            " -outputPrefix <output prefix> Filename prefix for image files/n" +
            " -startPage <number>          The first page to start extraction(1 based)/n" +
            " -endPage <number>            The last page to extract(inclusive)/n" +
            " <PDF file>                   The PDF document to use/n"
            );
        System.exit( 1 );
    }