tm-extractors是封装了POI的word读取工具。下载jar包,导入到工程中便可以使用了。代码如下:
package com.you.read;
import java.io.FileInputStream;
import org.textmining.text.extraction.WordExtractor;
public class WordReader {
public static String readDoc(String doc) throws Exception {
FileInputStream in = new FileInputStream(doc);
WordExtractor extractor = null;
String text = null;
extractor = new WordExtractor();
text = extractor.extractText(in);
return text;
}
public static void main(String[] args) {
try {
String text = WordReader.readDoc("d:/bloom.doc");
System.out.println(text);
} catch (Exception e) {
e.printStackTrace();
}
}
}
运行结果抛出异常,为:
org.textmining.text.extraction.FastSavedException: Fast-saved files are unsupported at this time
at org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:63)
at com.you.read.WordReader.readDoc(WordReader.java:14)
at com.you.read.WordReader.main(WordReader.java:20)
为什么呢?我的原因是:.doc文档经过了wps编辑。抛出了异常。经microsoft word 2003编辑,运行结果为:
我在马路边捡到一分钱,把它交到警察叔叔手里面,叔叔拿着钱对我把头点,我高兴的说了声,叔叔再见。
读取成功。