背景:由于最近小论文实验需要林业领域相关的一些法律案例数据集,而网上大多是刑法相关的,于是乎需要自己去专门下载林法相关的法律裁判文书,再去做预处理及人工标注。
在裁判文书网爬取下载了一批林法案例数据集后,是一批doc文件,然后我写了一个预处理的java程序,就需要将这批doc先挨个读到程序转换成字符串再进行处理,下面是我java读取本地doc文件的工具类:
先引入pom依赖
<!-- Apache POI Core -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.2.3</version>
</dependency>
<!-- Apache POI - HWPF for .doc files -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>5.2.3</version>
</dependency>
<!-- Apache POI - XWPF for .docx files -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.3</version>
</dependency>
DocUtil:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.tika.Tika;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class DocUtil {
public static void main(String[] args) {
String directoryPath = "F:\\xxxx\\xxxx\\农清欢滥伐林木罪滥伐林木罪刑事一审刑事判决书.doc";
String content = readDoc(directoryPath);
System.out.println(content);
}
/**
* 读取 .doc 文件内容
*
* @param path 文件路径
* @return 文件内容的字符串表示
* @throws IOException 如果读取文件时发生错误
*/
public static String readDoc(String path) throws IOException {
File file = new File(path);
StringBuilder content = new StringBuilder();
try (FileInputStream fis = new FileInputStream(file);
HWPFDocument document = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document)) {
String[] paragraphs = extractor.getParagraphText();
for (String paragraph : paragraphs) {
content.append(paragraph).append("\n");
}
} catch (IllegalArgumentException e) {
System.err.println("读取 .doc 文件失败,可能文件格式不正确: " + e.getMessage());
throw e;
}
return content.toString();
}
/**
* 读取 .docx 文件内容
*
* @param path 文件路径
* @return 文件内容的字符串表示
* @throws IOException 如果读取文件时发生错误
*/
public static String readDocx(String path) throws IOException {
File file = new File(path);
StringBuilder content = new StringBuilder();
try (FileInputStream fis = new FileInputStream(file);
XWPFDocument document = new XWPFDocument(fis);
XWPFWordExtractor extractor = new XWPFWordExtractor(document)) {
String docText = extractor.getText();
content.append(docText);
}
return content.toString();
}
}
先用单个文件测试一下,没想到报错了:
java.lang.IllegalArgumentException: The document is really a HTML file
at org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:141)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:223)
at com.backend.springbootinit.model.tokenizer.DocUtil.readDoc(DocUtil.java:30)
at com.backend.springbootinit.model.tokenizer.DocUtil.main(DocUtil.java:20)
竟然说我读取的是一个html?这明明是一个.doc文件,但是没想到实际是个html文件,于是我又在工具类里试着加了一个readHtml,看看读出来的网页内容是啥,为了避免乱码,我们先调整一下编码为GBK:
pom依赖:
<!-- JSoup for HTML parsing -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.4</version>
</dependency>
工具类代码:
/**
* 读取 HTML 文件内容
*
* @param path 文件路径
* @return 文件内容的纯文本表示
* @throws IOException 如果读取文件时发生错误
*/
public static String readHtml(String path) throws IOException {
File file = new File(path);
StringBuilder content = new StringBuilder();
// "GBK" 或 "UTF-8"
Document doc = Jsoup.parse(file, "GBK");
String text = doc.body().text();
content.append(text);
return content.toString();
}
果不其然,doc内容都出来了,然后还带了一堆前端标签,此时再把字符串分割一下,去掉无关正文内容就行了,看来有些文件名的.doc后缀可能是障眼法,记录一下这个坑。
顺便完善一下doc文件的检测,不单单是检查是否endWith("doc"),还要检测一下文件的真正类型,这里使用Tika检查:
pom依赖:
<!-- Apache Tika for file type detection -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.8.0</version>
</dependency>
工具类代码:
/**
* 根据文件路径读取文件内容,支持 .doc, .docx 和 HTML 格式
*
* @param path 文件路径
* @return 文件内容的字符串表示
*/
public static String readFile(String path) {
File file = new File(path);
String content = "";
if (!file.exists() || !file.isFile()) {
System.err.println("指定的路径不存在或不是一个文件: " + path);
return content;
}
Tika tika = new Tika();
String mimeType = "";
try (FileInputStream fis = new FileInputStream(file)) {
mimeType = tika.detect(fis, file.getName());
} catch (IOException e) {
System.err.println("检测文件类型时出错: " + e.getMessage());
return content;
}
System.out.println("文件 MIME 类型: " + mimeType);
try {
switch (mimeType) {
case "application/msword":
content = readDoc(path);
break;
case "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
content = readDocx(path);
break;
case "text/html":
content = readHtml(path);
break;
default:
System.err.println("不支持的文件类型: " + mimeType);
}
} catch (Exception e) {
System.err.println("读取文件时出错: " + e.getMessage());
e.printStackTrace();
}
return content;
}