在Heritrix3.3.0源码阅读 crawler-beans.cxml中URI过滤规则的配置中,我们看到了Heritrix3.3.0配置的用于决定URI是否被接受的类。而本文的目的是,通过阅读源码,了解
(1)一个URI处理类是怎样工作的
(2)一系列URI处理类是如何配合工作的。
首先,我们来解决第一个问题。
(一)
所有URI处理类都必须继承DecideRule抽象类:
<span style="font-size:24px;">package org.archive.modules.deciderules;
import java.io.Serializable;
import org.archive.modules.CrawlURI;
import org.archive.spring.HasKeyedProperties;
import org.archive.spring.KeyedProperties;
public abstract class DecideRule implements Serializable, HasKeyedProperties {
// 一个线程安全的HashMap,用于保存一些键值对
protected KeyedProperties kp = new KeyedProperties();
public KeyedProperties getKeyedProperties() {
return kp;
}
{
setEnabled(true);
}
public boolean getEnabled() {
return (Boolean) kp.get("enabled");
}
public void setEnabled(boolean enabled) {
kp.put("enabled",enabled);
}
protected String comment = "";
public String getComment() {
return comment;
}
public void setComment(String comment) {
this.comment = comment;
}
public DecideRule() {
}
/**
* 为一个URI做决策
* @param uri
* @return
*/
public DecideResult decisionFor(CrawlURI uri) {
// enabled的状态为false就返回DecideResult.NONE
if (!getEnabled()) {
return DecideResult.NONE;
}
// innerDecide方法才是用来做决策的
DecideResult result = innerDecide(uri);
// 我觉得是废话,如果有谁知道用处,希望告知
if (result == DecideResult.NONE) {
return result;
}
return result;
}
/**
* 真正做决策的方法
* @param uri
* @return
*/
protected abstract DecideResult innerDecide(CrawlURI uri);
/**
* 该方法在该规则只有一个决策结果时有用
* @param uri
* @return
*/
public DecideResult onlyDecision(CrawlURI uri) {
return null;
}
/**
* 判断是否接受某个URI
* @param uri
* @return
*/
public boolean accepts(CrawlURI uri) {
// 通过decisionFor方法的判定结果与DecideResult.ACCEPT作比较
// 来判定是否接受某个URI
return DecideResult.ACCEPT == decisionFor(uri);
}
}</span>
enable的值决定了一个处理类是否处理URI,true表示处理,false表示不处理。用来获得该处理类对某个URI的处理结果的方法是decisionFor。这个方法在enable为false时直接返回NONE(它的意义接下来就会给出);如果enable为true,就调用innerDecide方法来对URI进行处理。innerDecide方法在子类中实现。这里还必须提提onlyDecision方法,它在处理类仅会返回一种处理结果时有用。
接下来看看DecideRule中老是出现的DecideResult:
package org.archive.modules.deciderules;
/**
* The decision of a DecideRule.
*
* DecideRule决定
*
* @author pjack
*/
public enum DecideResult {
/** Indicates the URI was accepted. */
// 表示这个URI是被接受的
ACCEPT,
/** Indicates the URI was neither accepted nor rejected. */
// 表示这个URI及没有被接受,也没有被拒绝
NONE,
/** Indicates the URI was rejected. */
// 表示这个URI被拒绝了
REJECT;
/**
* 反转结果
* @param result
* @return
*/
public static DecideResult invert(DecideResult result) {
switch (result) {
case ACCEPT:
return REJECT;
case REJECT:
return ACCEPT;
default:
return result;
}
}
}
它的作用看一眼就明了了,就不多说了。
接下来,选两个DecideRule的具体子类来说说。先看看RejectDecideRule类,它是配置的第一个具体处理类:
package org.archive.modules.deciderules;
import org.archive.modules.CrawlURI;
/**
* 该类对所有URI返回结果都为DecideResult.REJECT
*
*/
public class RejectDecideRule extends DecideRule {
private static final long serialVersionUID = 3L;
@Override
protected DecideResult innerDecide(CrawlURI uri) {
return DecideResult.REJECT;
}
@Override
public DecideResult onlyDecision(CrawlURI uri) {
return DecideResult.REJECT;
}
}
这个处理类重写了DecideRule的innerDecide方法和onlyDecision方法。从它简短的代码中一眼就能看出,它对所有URI都返回REJECT。
然后看看TooManyHopsDecideRule:
/**
* Rule REJECTs any CrawlURIs whose total number of hops (length of the
* hopsPath string, traversed links of any type) is over a threshold.
* Otherwise returns PASS.
*
* 规则拒绝所有这样的CrawlURIs:它们的跳数(深度)大于阈值。对于另外的CrawlURIs,
* 既不接受,也不拒绝。
*
* @author gojomo
*/
public class TooManyHopsDecideRule extends PredicatedDecideRule {
private static final long serialVersionUID = 3L;
/** default for this class is to REJECT */
/**
* 默认情况下,返回DecideResult.REJECT
*/
{
setDecision(DecideResult.REJECT);
}
/**
* Max path depth for which this filter will match.
*/
/**
* 设置默认最大深度
*/
{
setMaxHops(20);
}
public int getMaxHops() {
return (Integer) kp.get("maxHops");
}
public void setMaxHops(int maxHops) {
kp.put("maxHops", maxHops);
}
/**
* Usual constructor.
*/
public TooManyHopsDecideRule() {
}
/**
* Evaluate whether given object is over the threshold number of
* hops.
*
* 评估给的CrawlURI是否超过了设置的最大深度
*
* @param object
* @return true if the mx-hops is exceeded
*/
@Override
protected boolean evaluate(CrawlURI uri) {
return uri.getHopCount() > getMaxHops();
}
}
要讲这个类,还必须看看它的直接父类的代码:
/**
* Rule which applies the configured decision only if a
* test evaluates to true. Subclasses override evaluate()
* to establish the test.
*
* 当evaluate方法返回true时,才应用配置的规则。子类需要重写evaluate
* 函数。
*
* @author gojomo
*/
public abstract class PredicatedDecideRule extends DecideRule {
{
setDecision(DecideResult.ACCEPT);
}
public DecideResult getDecision() {
return (DecideResult) kp.get("decision");
}
public void setDecision(DecideResult decision) {
kp.put("decision",decision);
}
public PredicatedDecideRule() {
}
@Override
protected DecideResult innerDecide(CrawlURI uri) {
if (evaluate(uri)) {
return getDecision();
}
return DecideResult.NONE;
}
protected abstract boolean evaluate(CrawlURI object);
}
PredicatedDecideRule重写了DecideRule的innerDecide,而innerDecide方法又把决策委托给evaluate方法去做,evaluate方法在TooManyHopsDecideRule中被重写。TooManyHopsDecideRule在URI的深度小于设置的最大深度时,返回ACCEPT;对其它URI返回NONE。
由这几个类的代码阅读可以看出,处理类用于得出结果的方法是innerDecide;当我们需要写我们自己的URI处理类时,只需要继承DecideRule,并重写innerDecide方法就行。
接下来看看,处理序列中的多个处理类是怎样协同工作的。
(2)
我们看看DecideRuleSequence类:
package org.archive.modules.deciderules;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.archive.modules.CrawlURI;
import org.archive.modules.SimpleFileLoggerProvider;
import org.archive.modules.net.CrawlHost;
import org.archive.modules.net.ServerCache;
import org.json.JSONObject;
import org.springframework.beans.factory.BeanNameAware;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.Lifecycle;
public class DecideRuleSequence extends DecideRule implements BeanNameAware, Lifecycle {
final private static Logger LOGGER =
Logger.getLogger(DecideRuleSequence.class.getName());
private static final long serialVersionUID = 3L;
protected transient Logger fileLogger = null;
/**
* If enabled, log decisions to file named logs/{spring-bean-id}.log. Format
* is: [timestamp] [decisive-rule-num] [decisive-rule-class] [decision]
* [uri] [extraInfo]
*
* Relies on Spring Lifecycle to initialize the log. Only top-level
* beans get the Lifecycle treatment from Spring, so bean must be top-level
* for logToFile to work. (This is true of other modules that support
* logToFile, and anything else that uses Lifecycle, as well.)
*/
/**
* 如果logToFile为真,就把决策放到日志文件中:logs/{spring-bean-id}.log。
*/
{
setLogToFile(false);
}
public boolean getLogToFile() {
return (Boolean) kp.get("logToFile");
}
public void setLogToFile(boolean enabled) {
kp.put("logToFile",enabled);
}
/**
* Whether to include the "extra info" field for each entry in crawl.log.
* "Extra info" is a json object with entries "host", "via", "source" and
* "hopPath".
*/
protected boolean logExtraInfo = false;
public boolean getLogExtraInfo() {
return logExtraInfo;
}
public void setLogExtraInfo(boolean logExtraInfo) {
this.logExtraInfo = logExtraInfo;
}
// provided by CrawlerLoggerModule which is in heritrix-engine, inaccessible
// from here, thus the need for the SimpleFileLoggerProvider interface
protected SimpleFileLoggerProvider loggerModule;
public SimpleFileLoggerProvider getLoggerModule() {
return this.loggerModule;
}
@Autowired
public void setLoggerModule(SimpleFileLoggerProvider loggerModule) {
this.loggerModule = loggerModule;
}
@SuppressWarnings("unchecked")
public List<DecideRule> getRules() {
return (List<DecideRule>) kp.get("rules");
}
/**
* 在这里把规则集合注入了进来
* @param rules
*/
public void setRules(List<DecideRule> rules) {
kp.put("rules", rules);
}
protected ServerCache serverCache;
public ServerCache getServerCache() {
return this.serverCache;
}
@Autowired
public void setServerCache(ServerCache serverCache) {
this.serverCache = serverCache;
}
/**
* 真正做决定的方法;
* 从这个方法可以看出,在规则链的后面的规则得出的非DecideResult.NONE决策
* 会覆盖前面的规则得出的决策。
*/
public DecideResult innerDecide(CrawlURI uri) {
DecideRule decisiveRule = null;
// 真正做决定的规则
int decisiveRuleNumber = -1;
// 默认既不拒绝,也不接受
DecideResult result = DecideResult.NONE;
List<DecideRule> rules = getRules();
int max = rules.size();
for (int i = 0; i < max; i++) {
DecideRule rule = rules.get(i);
if (rule.onlyDecision(uri) != result) {
DecideResult r = rule.decisionFor(uri);
if (LOGGER.isLoggable(Level.FINEST)) {
LOGGER.finest("DecideRule #" + i + " " +
rule.getClass().getName() + " returned " + r + " for url: " + uri);
}
if (r != DecideResult.NONE) {
result = r;
decisiveRule = rule;
decisiveRuleNumber = i;
}
}
}
decisionMade(uri, decisiveRule, decisiveRuleNumber, result);
return result;
}
/**
* 在一个CrawlURI被决定是否接受之后被调用的方法
* @param uri
* @param decisiveRule
* @param decisiveRuleNumber
* @param result
*/
protected void decisionMade(CrawlURI uri, DecideRule decisiveRule,
int decisiveRuleNumber, DecideResult result) {
if (fileLogger != null) {
JSONObject extraInfo = null;
if (logExtraInfo) {
CrawlHost crawlHost = getServerCache().getHostFor(uri.getUURI());
String host = "-";
if (crawlHost != null) {
host = crawlHost.fixUpName();
}
extraInfo = new JSONObject();
extraInfo.put("hopPath", uri.getPathFromSeed());
extraInfo.put("via", uri.getVia());
extraInfo.put("seed", uri.getSourceTag());
extraInfo.put("host", host);
}
fileLogger.info(decisiveRuleNumber
+ " " + decisiveRule.getClass().getSimpleName()
+ " " + result
+ " " + uri
+ (extraInfo != null ? " " + extraInfo : ""));
}
}
protected String beanName;
public String getBeanName() {
return this.beanName;
}
@Override
public void setBeanName(String name) {
this.beanName = name;
}
protected boolean isRunning = false;
@Override
public boolean isRunning() {
return isRunning;
}
@Override
public void start() {
// 实例化日志
if (getLogToFile() && fileLogger == null) {
fileLogger = loggerModule.setupSimpleLog(getBeanName());
}
isRunning = true;
}
@Override
public void stop() {
isRunning = false;
}
}
这个类同样是DecideRule的子类,它重写了innerDecide方法,并从该方法的实现可以看出,当后面的处理类的返回结果不为NONE时,新的结果就会覆盖老的结果。这时,我们终于明白了配置文件中的这句话:
<!-- SCOPE: rules for which discovered URIs to crawl; order is very
important because last decision returned other than 'NONE' wins. -->
所以,当我们需要定制我们自己的URI过滤过则时,我们不仅需要控制innerDecide的行为,还需要调整各个处理类的顺序。
(由于各个类的内容少且简单,故把所有代码都贴上来了)