heritrix3 与heritrix1.14 相比有很大不同, heritrix3 定义了一种阻塞的FIFO queue, 属于典型的生产消费者模型
AbstractFrontier 中定义了2个 容器, inbound 和outbound
inbound 容器存储的是那些即将要处理的crawlUrI, heritrix 爬取到的链接, 准备处理的链接都是先放在inbound 当中.
outbound 容器存储的当前要处理的crawlUrI, Frontier 链接工厂会从outbound取出链接并处理
AbstractFrontier启动时实例化 inbound , outbound
inbound 的容量是outbound 的十倍
inbound中存的是一个个等处理的事件
从inbound中处理这些事件
未完,待续。。。。
AbstractFrontier 中定义了2个 容器, inbound 和outbound
inbound 容器存储的是那些即将要处理的crawlUrI, heritrix 爬取到的链接, 准备处理的链接都是先放在inbound 当中.
outbound 容器存储的当前要处理的crawlUrI, Frontier 链接工厂会从outbound取出链接并处理
/** inbound updates: URIs to be scheduled, finished; requested state changes */
transient protected ArrayBlockingQueue inbound;
/** outbound URIs */
transient protected ArrayBlockingQueue outbound;
AbstractFrontier启动时实例化 inbound , outbound
inbound 的容量是outbound 的十倍
public void start() {
if(isRunning()) {
return;
}
if (getRecoveryLogEnabled()) try {
initJournal(loggerModule.getPath().getFile().getAbsolutePath());
} catch (IOException e) {
throw new IllegalStateException(e);
}
this.outboundCapacity = getOutboundQueueCapacity();
this.inboundCapacity = outboundCapacity *
getInboundQueueMultiple();
outbound = new ArrayBlockingQueue<CrawlURI>(outboundCapacity, true);
inbound = new ArrayBlockingQueue<InEvent>(inboundCapacity, true);
pause();
startManagerThread();
}
inbound中存的是一个个等处理的事件
从inbound中处理这些事件
/**
* Drain the inbound queue of update events, or at the very least
* wait until some additional delayed-queue URI becomes available.
*
* @throws InterruptedException
*/
protected void drainInbound() throws InterruptedException {
int batch = inbound.size();
for(int i = 0; i < batch; i++) {
inbound.take().process();
}
if(batch==0) {
// always do at least one timed try
InEvent toProcess = inbound.poll(getMaxInWait(),
TimeUnit.MILLISECONDS);
if (toProcess != null) {
toProcess.process();
}
}
}
未完,待续。。。。