Inject类下的InjectMapper中有一句:
try {
scfilters.injectedScore(value, datum);
} catch (ScoringFilterException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Cannot filter injected score for url " + url
+ ", using default (" + e.getMessage() + ")");
}
}
里面调用的是ScoringFilters类的实例里面的方法:public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException
作用是:通过分数作为度量,计算(初始化)每个通过substitution和filter后的种子url分数。
接下来,说说ScoringFilters类,首先它实现了ScoringFilter接口,并且里面有一个private ScoringFilter[] filters;变量
再看看public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException方法的源代码:
/** Calculate a new initial score, used when injecting new pages. */
public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException {
for (int i = 0; i < this.filters.length; i++) {
this.filters[i].injectedScore(url, datum);
}
}
可见ScoringFilters类实际上的injectedScore的工作是通过调用private ScoringFilter[] filters;变量中的每个filter的injectedScore()方法来完成的。
所以,很有必要知道怎么样去指定private ScoringFilter[] filters中的内容的。
留意到ScoringFilters类的public ScoringFilters(Configuration conf) 构造函数。
下面是public ScoringFilters(Configuration conf) 的源代码:(注意为了实验,作者增加了一些测试代码)
public ScoringFilters(Configuration conf) {
super(conf);
ObjectCache objectCache = ObjectCache.get(conf);
//读取conf文件(conf/nutch-default.xml或者conf/nutch-site.xml)下的配置文件选项(scoring.filter.order)获取次序
String order = conf.get("scoring.filter.order");
//查看缓存中的内容,看是否存在filters(已经包含了次序的);比如,第二次使用时候,就有用了
this.filters = (ScoringFilter[]) objectCache.getObject(ScoringFilter.class.getName());
if (this.filters == null) {//第一次启用ScoringFilters类
//通过conf文件(conf/nutch-default.xml或者conf/nutch-site.xml)下的配置文件选项(scoring.filter.order)获取次序
// String[] 是一个数组,使用次序的;
String[] orderedFilters = null;
if (order != null && !order.trim().equals("")) {
orderedFilters = order.split("\\s+");
}
try {
//test by kaiwii
System.out.println("try block");
ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(ScoringFilter.X_POINT_ID);
if (point == null) throw new RuntimeException(ScoringFilter.X_POINT_ID + " not found.");
Extension[] extensions = point.getExtensions();
HashMap<String, ScoringFilter> filterMap =
new HashMap<String, ScoringFilter>();
for (int i = 0; i < extensions.length; i++) {//通过ScoringFilter.X_POINT_ID获取扩展点,从而获取所有的ScoringFilter的实现类
//ScoringFilter的实现类与扩展点之间的对应关系,通过遍历插件中的plugin.xml中获得(详情查看PluginRepository类的实现)
Extension extension = extensions[i];
//for test by kaiwii
System.out.println("extension_id "+i+extension.getId());
System.out.println("extension_clazz "+i+extension.getClazz());
//因为插件的实现采用了lazy_load的方式,所以上面获取extension时,只是获取了一个plugindescriptor(包含的仅是某个plugin的信息而已)
//这里才真正进行实例化
ScoringFilter filter = (ScoringFilter) extension.getExtensionInstance();
if (!filterMap.containsKey(filter.getClass().getName())) {//用一个map来组织起所有filter的实例
filterMap.put(filter.getClass().getName(), filter);
}
}
if (orderedFilters == null) {//当conf文件(conf/nutch-default.xml或者conf/nutch-site.xml)下的配置文件选项(scoring.filter.order)为空的时候
objectCache.setObject(ScoringFilter.class.getName(), filterMap.values().toArray(new ScoringFilter[0]));
} else {
ScoringFilter[] filter = new ScoringFilter[orderedFilters.length];
for (int i = 0; i < orderedFilters.length; i++) {//按照orderedFilters指定的顺序,把filter放进一个临时变量中
filter[i] = filterMap.get(orderedFilters[i]);
}
objectCache.setObject(ScoringFilter.class.getName(), filter);
}
} catch (PluginRuntimeException e) {
throw new RuntimeException(e);
}
//实际就是上面filter的值而言
this.filters = (ScoringFilter[]) objectCache.getObject(ScoringFilter.class.getName());
//for test by kaiwii
System.out.println("filters content:kaiwii want to know:");
for(ScoringFilter v:this.filters){
System.out.println("filter:"+v.toString());
}
}
}
到此,至于怎么样的ScoringFilter实现类会放到里面去,应该大概有个明白了吧?
其实,还要留意一下conf文件(conf/nutch-default.xml或者conf/nutch-site.xml)下的配置文件选项(scoring.filter.order):
<!-- scoring filters properties -->
<property>
<name>scoring.filter.order</name>
<value></value>
<description>The order in which scoring filters are applied.
This may be left empty (in which case all available scoring
filters will be applied in the order defined in plugin-includes
and plugin-excludes), or a space separated list of implementation
classes.
</description>
</property>
通过上面的英文介绍,你应该可以知道为空的时候,怎么样的ScoringFilter实现类会被调用,次序怎么样,就是 plugin.includes和plugin-excludes说了算。
就以默认的设置来看看吧,首先,scoring.filter.order为空;
然后 plugin.includes设置为:(相关内容高亮出来了!)
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
最后,plugin.excludes为空。
结合修改后的测试代码,看看run inject类的console怎么样:(相关内容高亮出来!)
Injector: starting at 2011-09-04 09:11:13
Injector: crawlDb: crawldb
Injector: urlDir: urls.txt
Injector: Converting injected urls to crawl db entries.
try block
extension_id 0org.apache.nutch.scoring.opic.OPICScoringFilter
extension_clazz 0org.apache.nutch.scoring.opic.OPICScoringFilter
filters content:kaiwii want to know:
filter:org.apache.nutch.scoring.opic.OPICScoringFilter@12a3793
using method:regexNormalize()
kaiwii want to know confiFilenull
use global rules
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-09-04 09:11:19, elapsed: 00:00:05
通过,上面的console显示得知,只有org.apache.nutch.scoring.opic.OPICScoringFilter被调用了。
所以,要使用相关的ScoringFilter实现类,就要按照下面的操作进行了:
The order in which scoring filters are applied.
This may be left empty (in which case all available scoring
filters will be applied in the order defined in plugin-includes
and plugin-excludes), or a space separated list of implementation
classes.
总结:熟悉插件机制!!!!!!!!!!!!!!!!