nutch-0.8中添加中文分词
1。抓取网页时, NutchDocumentAnalyzer.java文件中添加自己的分词
方法中
/** Returns a new token stream for text from the named field. */
public TokenStream tokenStream(String fieldName, Reader reader) {
Analyzer analyzer;
if ("anchor".equals(fieldName))
analyzer = ANCHOR_ANALYZER;
else
analyzer = CONTENT_ANALYZER;
OO4UAnalyzer tokenstream =new OO4UAnalyzer ();
//tokenstream.tokenStream()
// return analyzer.tokenStream(fieldName, reader);
return tokenstream.tokenStream(fieldName, reader);
}
}
2。搜索时,在NutchAnalysis.java文件中添加分词
final public ArrayList phrase(String field) throws ParseException {
。。。。
。。。。
if (this.queryFilters.isRawField(field)) {
result.clear();
StringReader input;
input = new java.io.StringReader(queryString);
try{
org.apache.lucene.analysis.TokenStream tokenizer = newOO4UTokenizer(input);
//just a demo
for (org.apache.lucene.analysis.Token t = tokenizer.next(); t != null;
t = tokenizer.next()) {
// String[] array = {t.termText()};
result.add(t.termText());
}
}catch(IOException we){}
// result.add(queryString.substring(start, end));
}
{if (true) return result;}
throw new Error("Missing return statement in function");
}
/** Parse a compound term that is interpreted as an implicit phrase query.
* Compounds are a sequence of terms separated by infix characters. Note that
* htis may return a single term, a trivial compound. */
final public ArrayList compound(String field) throws ParseException {
.......
......
if (this.queryFilters.isRawField(field)) {
// result.clear();
StringReader input;
input = new java.io.StringReader(queryString);
try{
org.apache.lucene.analysis.TokenStream tokenizer = new OO4UTokenizer(input);
//just a demo
for (org.apache.lucene.analysis.Token t = tokenizer.next(); t != null;
t = tokenizer.next()) {
// String[] array = {t.termText()};
}
}catch(IOException we){}
// result.add(queryString.substring(start, token.endColumn));
} else {
StringReader input;
input = new java.io.StringReader(queryString);
try{
org.apache.lucene.analysis.TokenStream tokenizer = new OO4UTokenizer(input);
//just a demo
for (org.apache.lucene.analysis.Token t = tokenizer.next(); t != null;
t = tokenizer.next()) {
// String[] array = {t.termText()};
result.add(t.termText());
}
}catch(IOException we){}
}
{if (true) return result;}
throw new Error("Missing return statement in function");
}
2.我的建立索引的程序是自动执行的,现在出现一个问题,如果在那个建立索引的程序正在运行的过程中,运行搜索的程序,随便搜索一个词,就会报错java.io.IOException: Too many open files,以后再搜索都不会成功,只能重新启动服务才可以继续搜索,这个问题怎么解决呢?
可能是你建索引的模式错了,索引构建有两种模式:全新构建和追加构建,采用追加构建不会出现这样的问题吧,但必须保证构建程序的同步。IndexUtil.getIndexWriter(path,false)应该就可以了
我现在是在另一个文件夹下重建,,,然后把索引拷回原目录,,,这样就没问题
本文介绍如何在Nutch-0.8版本中实现中文分词功能,包括修改NutchDocumentAnalyzer.java和NutchAnalysis.java两个核心文件的具体步骤。同时探讨了在构建索引过程中遇到的Too many open files错误及其解决方案。
1345

被折叠的 条评论
为什么被折叠?



