Java将doc文件转字符串

像云~

已于 2024-11-11 22:54:21 修改

阅读量465

点赞数 12

分类专栏：记录文章标签： java 开发语言

于 2024-11-11 21:56:20 首次发布

本文链接：https://blog.youkuaiyun.com/xy779323365/article/details/143695343

版权

记录专栏收录该内容

7 篇文章

订阅专栏

背景：由于最近小论文实验需要林业领域相关的一些法律案例数据集，而网上大多是刑法相关的，于是乎需要自己去专门下载林法相关的法律裁判文书，再去做预处理及人工标注。

在裁判文书网爬取下载了一批林法案例数据集后，是一批doc文件，然后我写了一个预处理的java程序，就需要将这批doc先挨个读到程序转换成字符串再进行处理，下面是我java读取本地doc文件的工具类：

先引入pom依赖

<!-- Apache POI Core -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>5.2.3</version>

</dependency>
<!-- Apache POI - HWPF for .doc files -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>5.2.3</version>
</dependency>

<!-- Apache POI - XWPF for .docx files -->
<dependency>
     <groupId>org.apache.poi</groupId>
     <artifactId>poi-ooxml</artifactId>
     <version>5.2.3</version>
</dependency>

DocUtil：

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.tika.Tika;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class DocUtil {
    public static void main(String[] args) {
        String directoryPath = "F:\\xxxx\\xxxx\\农清欢滥伐林木罪滥伐林木罪刑事一审刑事判决书.doc";
        String content = readDoc(directoryPath);
        System.out.println(content);
    }

    

    /**
     * 读取 .doc 文件内容
     *
     * @param path 文件路径
     * @return 文件内容的字符串表示
     * @throws IOException 如果读取文件时发生错误
     */
    public static String readDoc(String path) throws IOException {
        File file = new File(path);
        StringBuilder content = new StringBuilder();

        try (FileInputStream fis = new FileInputStream(file);
             HWPFDocument document = new HWPFDocument(fis);
             WordExtractor extractor = new WordExtractor(document)) {

            String[] paragraphs = extractor.getParagraphText();
            for (String paragraph : paragraphs) {
                content.append(paragraph).append("\n");
            }
        } catch (IllegalArgumentException e) {
            System.err.println("读取 .doc 文件失败，可能文件格式不正确: " + e.getMessage());
            throw e;
        }

        return content.toString();
    }

    /**
     * 读取 .docx 文件内容
     *
     * @param path 文件路径
     * @return 文件内容的字符串表示
     * @throws IOException 如果读取文件时发生错误
     */
    public static String readDocx(String path) throws IOException {
        File file = new File(path);
        StringBuilder content = new StringBuilder();

        try (FileInputStream fis = new FileInputStream(file);
             XWPFDocument document = new XWPFDocument(fis);
             XWPFWordExtractor extractor = new XWPFWordExtractor(document)) {

            String docText = extractor.getText();
            content.append(docText);
        }

        return content.toString();
    }

  
}

先用单个文件测试一下，没想到报错了：

java.lang.IllegalArgumentException: The document is really a HTML file  
	at org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:141)  
	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:223)  
	at com.backend.springbootinit.model.tokenizer.DocUtil.readDoc(DocUtil.java:30)  
	at com.backend.springbootinit.model.tokenizer.DocUtil.main(DocUtil.java:20)

竟然说我读取的是一个html？这明明是一个.doc文件，但是没想到实际是个html文件，于是我又在工具类里试着加了一个readHtml，看看读出来的网页内容是啥，为了避免乱码，我们先调整一下编码为GBK：

pom依赖：

    <!-- JSoup for HTML parsing -->  
    <dependency>  
        <groupId>org.jsoup</groupId>  
        <artifactId>jsoup</artifactId>  
        <version>1.15.4</version>  
    </dependency>

工具类代码：

/**
     * 读取 HTML 文件内容
     *
     * @param path 文件路径
     * @return 文件内容的纯文本表示
     * @throws IOException 如果读取文件时发生错误
     */
    public static String readHtml(String path) throws IOException {
        File file = new File(path);
        StringBuilder content = new StringBuilder();

        //  "GBK" 或 "UTF-8"
        Document doc = Jsoup.parse(file, "GBK"); 
        String text = doc.body().text();
        content.append(text);

        return content.toString();
    }

果不其然，doc内容都出来了，然后还带了一堆前端标签，此时再把字符串分割一下，去掉无关正文内容就行了，看来有些文件名的.doc后缀可能是障眼法，记录一下这个坑。

顺便完善一下doc文件的检测，不单单是检查是否endWith("doc")，还要检测一下文件的真正类型，这里使用Tika检查：
pom依赖：

    <!-- Apache Tika for file type detection -->  
    <dependency>  
        <groupId>org.apache.tika</groupId>  
        <artifactId>tika-core</artifactId>  
        <version>2.8.0</version>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.tika</groupId>  
        <artifactId>tika-parsers-standard-package</artifactId>  
        <version>2.8.0</version>  
    </dependency>

工具类代码：

/**  
     * 根据文件路径读取文件内容，支持 .doc, .docx 和 HTML 格式  
     *  
     * @param path 文件路径  
     * @return 文件内容的字符串表示  
     */  
    public static String readFile(String path) {  
        File file = new File(path);  
        String content = "";  

        if (!file.exists() || !file.isFile()) {  
            System.err.println("指定的路径不存在或不是一个文件: " + path);  
            return content;  
        }  

        Tika tika = new Tika();  
        String mimeType = "";  
        try (FileInputStream fis = new FileInputStream(file)) {  
            mimeType = tika.detect(fis, file.getName());  
        } catch (IOException e) {  
            System.err.println("检测文件类型时出错: " + e.getMessage());  
            return content;  
        }  

        System.out.println("文件 MIME 类型: " + mimeType);  

        try {  
            switch (mimeType) {  
                case "application/msword":  
                    content = readDoc(path);  
                    break;  
                case "application/vnd.openxmlformats-officedocument.wordprocessingml.document":  
                    content = readDocx(path);  
                    break;  
                case "text/html":  
                    content = readHtml(path);  
                    break;  
                default:  
                    System.err.println("不支持的文件类型: " + mimeType);  
            }  
        } catch (Exception e) {  
            System.err.println("读取文件时出错: " + e.getMessage());  
            e.printStackTrace();  
        }  

        return content;  
    }