使用TextMiniing和Apache POI获得Word文件内容，无须MS-Office ActiveX

最新推荐文章于 2019-12-23 20:19:12 发布

最新推荐文章于 2019-12-23 20:19:12 发布 · 87 阅读

文章标签：

#Office #Apache #XP #Java

此博客展示了使用Java进行Word文件文本提取的代码。通过引入tm - extractors - 0.4.jar和相关库，定义了WordProcess类，其中run方法可从Word 2000/XP文件中提取文本，main方法将提取结果写入result.txt文件。

/*
* Created on 2005/07/18
* 使用tm-extractors-0.4.jar
*/
package com.nova.colimas.common.doc;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.textmining.text.extraction.WordExtractor;

/**
* Deal with ms-word 2000/xp files.
* @author tyrone
*
*/
public class WordProcess extends DocProcess {
public static String run(String filename){
WordExtractor extractor=null;
String text=null;
try{
FileInputStream in = new FileInputStream (filename);
extractor = new WordExtractor();
text=extractor.extractText(in);
}catch(Exception ex){
//log
return null;
}
return text;
}
public static void main(String[] args){
try{
FileOutputStream out=new FileOutputStream("result.txt");
out.write(WordProcess.run(args[0]).getBytes());
out.flush();
out.close();
}catch(Exception ex){
System.out.println(ex.toString());
}
}
}