源:http://banditjava.javaeye.com/blog/468303
本文主要描述的是如何将paoding分词用plugin方式集成到
nutch1.0中去,在集成之前首先要在eclipse中把nutch1.0编译通过。然后,写一个中文分词程序,配置好插件配置文件,重新打包编译。
如果有linux环境,就可以直接进行编译,如果没有linux环境,还需要下载并配置cygwin等模拟linux环境。
一.环境说明
工具:myeclipse6.5 ,jdk1.6.0_14,tomcat-6.0.20
软件:nutch1.0
相关软件请自行google,下载安装
二.配置eclipse
新建nutch工程后,配译会报错
1)下载缺失的包
从http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib
/,http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib
/下载MP3跟rtf的jar文件,分别拷贝到src/plugin/parse-mp3/lib 和
src/plugin/parse-rtf/lib/文件夹下
2)修改了@override错误
org.apache.nutch.indexer.solr.SolrDeleteDuplicates;
org.apache.nutch.util.domain.DomainStatistics;
//@override错误
将override注释掉
3)licensing issues修复
到这一步,一般的工程都会有两个错误,nutch的official 1.0 release版本中,这两个问题因为licensing
issues没有修复。接下来的就是最关键的部分了。
修改src\plugin\parse-rtf\src\java\org\apache\nutch\parse\rtf下RTFParseFactory.java
添加import
org.apache.nutch.parse.ParseResult;
将public
Parse getParse(Content content) {
改为public
ParseResult getParse(Content content) {
将return new
ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParse(conf);
改为return new
ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParseResult(content.getUrl(),
getConf());
将return new
ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata));
改为return
ParseResult.createParseResult(content.getUrl(),
new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata)));
修改src\plugin\parse-rtf\src\test\org\apache\nutch\parse\rtf下的TestRTFParser.java
将parse = new
ParseUtil(conf).parseByExtensionId("parse-rtf", content);
改为parse =
new ParseUtil(conf).parseByExtensionId("parse-rtf",
content).get(urlString);
到这一步,eclipse上面的工程就会没有错误了
三.配置paoding插件
1)写中文分词程序,继承NutchAnalyzer
package org.apache.nutch.analysis.zh;
// JDK imports
import java.io.Reader;
// Lucene imports
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
// Nutch imports
import org.apache.nutch.analysis.NutchAnalyzer;
public class ChineseAnalyzer extends NutchAnalyzer
{
private
final static Analyzer ANALYZER =
new
net.paoding.analysis.analyzer.PaodingAnalyzer();
public
ChineseAnalyzer() { }
public
TokenStream tokenStream(String fieldName, Reader reader) {
return ANALYZER.tokenStream(fieldName, reader);
}
}
2)配置插件目录在src/plugin下面,analysis-zh,lib-paoding-analyzers
把上面写好的ChineseAnalyzer放到analysis-zh/src下面,
修改plugin.xml文件
<plugin
id="analysis-zh"
name="Chinese Analysis
Plug-in"
version="1.0.0"
provider-name="net.paoding.analysis">
<runtime>
<library
name="analysis-zh.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import
plugin="nutch-extensionpoints"/>
<import
plugin="lib-paoding-analyzers"/>
</requires>
<extension
id="org.apache.nutch.analysis.zh"
name="Chinese Analyzer"
point="org.apache.nutch.analysis.NutchAnalyzer">
<implementation id="ChineseAnalyzer"
class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
<parameter name="lang"
value="zh"/>
</implementation>
</extension>
</plugin>
修改build.xml
<project name="analysis-zh"
default="jar-core">
<import
file="../build-plugin.xml"/>
<!-- Build compilation
dependencies -->
<target
name="deps-jar">
<ant target="jar" inheritall="false"
dir="../lib-paoding-analyzers"/>
</target>
<!-- Add compilation dependencies
to classpath -->
<path
id="plugin.deps">
<fileset
dir="${nutch.root}/build">
<include name="**/lib-paoding-analyzers/*.jar"
/>
</fileset>
</path>
</project>
lib-paoding-analyzers的配置同上,不再赘述。
3)配置src\plugin的build.xml
<target
name="deploy">
<ant dir="analysis-zh"
target="deploy"/>
<ant dir="lib-paoding-analyzers"
target="deploy"/>
...
</target>
<target
name="clean">
<ant dir="analysis-zh"
target="clean"/>
<ant dir="lib-paoding-analyzers"
target="clean"/>
...
</target>
4)修改nutch-default.xml,加入|analysis-(zh)|
加载paoding的jar包,和自己写的analysis-(zh) jar包
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
5)修改nutch工程的build.xml,targe war
<lib
dir="${build.dir}/analysis-zh">
<include
name="analysis-zh.jar"/>
</lib>
<lib
dir="${build.dir}/lib-paoding-analyzers">
<include
name="paoding-analysis.jar"/>
</lib>
四.重新编译
ant package
注意:nutch1.0 需要ant1.7.1才行,主要是touch任务需要ant 1.7.1支持
五.配置tomcat,修改webapps/cse/WEB-INF/classes/nutch-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>local</value>
</property>
<property><!--指定本地的index目录-->
<name>searcher.dir</name>
<value>/nutch/local/crawled</value>
</property>
<property>
</property>
</configuration>
六.配置运行环境
export PAODING_DIC_HOME=/nutch/dic
七.运行测试
http://localhost:8080/
2009-09-14 10:26:49,312 INFO PluginRepository -
Registered Plugins:
2009-09-14 10:26:49,312 INFO PluginRepository
-
the nutch core extension points (nutch-extensionpoints)
2009-09-14 10:26:49,312 INFO PluginRepository
-
Basic Query Filter (query-basic)
2009-09-14 10:26:49,312 INFO PluginRepository
-
Basic URL Normalizer (urlnormalizer-basic)
2009-09-14 10:26:49,312 INFO PluginRepository
-
Paoding Analysers (lib-paoding-analyzers)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Html Parse Plug-in (parse-html)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Basic Indexing Filter (index-basic)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Basic Summarizer Plug-in (summary-basic)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Site Query Filter (query-site)
2009-09-14 10:26:49,328 INFO PluginRepository
-
HTTP Framework (lib-http)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Text Parse Plug-in (parse-text)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Pass-through URL Normalizer (urlnormalizer-pass)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Regex URL Filter (urlfilter-regex)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Http Protocol Plug-in (protocol-http)
2009-09-14 10:26:49,328 INFO PluginRepository
-
XML Response Writer Plug-in (response-xml)
2009-09-14 10:26:49,328 INFO PluginRepository
-
Regex URL Normalizer (urlnormalizer-regex)
2009-09-14 10:26:49,328 INFO PluginRepository
-
OPIC Scoring Plug-in (scoring-opic)
2009-09-14 10:26:49,343 INFO PluginRepository
-
CyberNeko HTML Parser (lib-nekohtml)
2009-09-14 10:26:49,343 INFO PluginRepository
-
Anchor Indexing Filter (index-anchor)
2009-09-14 10:26:49,343 INFO PluginRepository
-
JavaScript Parser (parse-js)
2009-09-14 10:26:49,343 INFO PluginRepository
-
URL Query Filter (query-url)
2009-09-14 10:26:49,343 INFO PluginRepository
-
Chinese Analysis Plug-in (analysis-zh)
2009-09-14 10:26:49,343 INFO PluginRepository
-
Regex URL Filter Framework (lib-regex-filter)
2009-09-14 10:26:49,343 INFO PluginRepository
-
JSON Response Writer Plug-in (response-json)
2009-09-14 10:26:49,343 INFO PluginRepository -
Registered Extension-Points:
2009-09-14 10:26:49,359 INFO PluginRepository
-
Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
其中hinese Analysis Plug-in (analysis-zh)就是配置好的中文分词插件啦。
好,大功告成,用paoding爽一爽吧,分词效果“刚刚的”。