Java实现Word文档内容提取--doc篇

南风o

于 2025-03-05 15:55:16 发布

阅读量304

点赞数 3

文章标签： java word 开发语言

本文链接：https://blog.youkuaiyun.com/qq_43388295/article/details/146045080

版权

Java实现Word文档内容提取：文本提取与图片保存

一、功能概述

本文介绍一个基于Java的文档处理工具类DocExtractorUtil，实现以下核心功能：
文本提取：从.doc格式的Word文档中提取纯文本内容
图片提取与保存：
从RTF格式文档中提取图片
控制台输出图片信息
本地保存图片文件

二、依赖准备

<!-- Apache POI Word处理 -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>3.17</version>
</dependency>

<!-- 图片处理 -->
<dependency>
    <groupId>javax.imageio</groupId>
    <artifactId>imageio</artifactId>
    <version>1.4</version>
</dependency>

<!-- Spire.Doc（用于RTF解析） -->
<dependency>
    <groupId>e-iceblue</groupId>
    <artifactId>spire.doc</artifactId>
    <version>3.12.0</version>
</dependency>

三、核心代码解析

文本提取方法

public static String extractText(InputStream inputStream) {
    // 使用HWPFDocument处理.doc格式
    HWPFDocument doc = new HWPFDocument(inputStream);
    Range range = doc.getRange();
    // 遍历所有段落
    StringBuilder ret = new StringBuilder();
    for (int i = 0; i < numP; ++i) {
        Paragraph p = range.getParagraph(i);
        ret.append(p.text());
    }
    return ret.toString();
}

图片提取与保存

public static void extractImage(InputStream inputStream) {
    try {
        Document document = new Document();
        document.loadRtf(inputStream);

        Queue<ICompositeObject> nodes = new LinkedList<>();
        nodes.add(document);

        int imageCount = 0;
        while (!nodes.isEmpty()) {
            ICompositeObject node = nodes.poll();
            for (int i = 0; i < node.getChildObjects().getCount(); i++) {
                IDocumentObject child = node.getChildObjects().get(i);
                if (child instanceof ICompositeObject) {
                    nodes.add((ICompositeObject) child);
                } else if (child.getDocumentObjectType() == DocumentObjectType.Picture) {
                    DocPicture picture = (DocPicture) child;
                    
                    // 生成图片文件名
                    String imageType = picture.getPictureType().toString().toLowerCase();
                    String fileName = "image_" + (++imageCount) + "." + imageType;
                    
                    // 保存图片到文件
                    ImageIO.write(picture.getImage(), imageType, 
                        new File("output/" + fileName));
                    
                    // 控制台输出信息
                    System.out.println("发现图片：");
                    System.out.println("├─ 文件名：" + fileName);
                    System.out.println("├─ 尺寸：" + picture.getWidth() + "x" + picture.getHeight());
                    System.out.println("└─ 格式：" + imageType.toUpperCase());
                }
            }
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}