搜索引擎Nutch源代码研究之一网页抓取（1）-优快云博客

本文详细解析了搜索引擎Nutch的爬虫源代码，重点介绍了Fetcher类的工作流程，包括多线程抓取机制、抓取状态处理及重定向逻辑。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

搜索引擎Nutch源代码研究之一网页抓取：
Nutch的爬虫代码部分主要集中在：package org.apache.nutch.fetcher和插件protocol-file
Protocol-ftp protocol-http protocol-httpclient以及相应的Parser插件中：
下面我们先从org.apache.nutch.fetcher开始：
最主要的类是Fetcher类，我们从它入手一步步跟踪整个代码：
我们从run函数入手：
首先：
[code]
for (int i = 0; i < threadCount; i++) { // spawn threads
FetcherThread thread = new FetcherThread(THREAD_GROUP_NAME+i);
thread.start();
}
[/code]
建立了多个FetcherThread线程来抓取网页，threadCount可以配置或者使用默认值。
接着一个while(true)的循环里面的代码：
[code]
int n = group.activeCount();
Thread[] list = new Thread[n];
group.enumerate(list);
boolean noMoreFetcherThread = true; // assumption
for (int i = 0; i < n; i++) {
// this thread may have gone away in the meantime
if (list[i] == null) continue;
String tname = list[i].getName();
if (tname.startsWith(THREAD_GROUP_NAME)) // prove it
noMoreFetcherThread = false;
if (LOG.isLoggable(Level.FINE))
LOG.fine(list[i].toString());
}
if (noMoreFetcherThread) {
if (LOG.isLoggable(Level.FINE))
LOG.fine("number of active threads: "+n);
if (pages == pages0 && errors == errors0 && bytes == bytes0)
break;
status();
pages0 = pages; errors0 = errors; bytes0 = bytes;
}
[/code]
相当于维护一个线程池，并在Log中输入抓取页面的速度，状态之类的信息。其实可以使用java.util.concurrent包的Executors来创建一个线程池来使用。
现在我们看看抓取的线程FetcherThread是如何工作的：
线程当然要从run方法来跟踪了：
FetchListEntry fle = new FetchListEntry();
建立一个抓取列表类，为了不分散精力，我们稍候在看看这个FetchListEntry以及相关类的数据结构。
然后又是一个while (true)的循环，我们看看里面做了些什么：
[code]
if (fetchList.next(fle) == null)
break;
url = fle.getPage().getURL().toString();
从当前的FetchListEntry中获得一个要抓取的url，然后
if (!fle.getFetch()) { // should we fetch this page?
if (LOG.isLoggable(Level.FINE))
LOG.fine("not fetching " + url);
handleFetch(fle, new ProtocolOutput(null, ProtocolStatus.STATUS_NOTFETCHING));
continue;
}
[/code]
如果不需要抓取，在handleFetch进行相应的处理。
然后又是一个do…while循环，用来处理抓取过程中重定向指定的次数：
整个循环的条件 refetch && (redirCnt < MAX_REDIRECT)
重新抓取并且重定向次数没有超出最大次数
ProtocolFactory工厂创建protocol实例：
Protocol protocol = ProtocolFactory.getProtocol(url);
Protocol的实现是以插件的形式提供的，我们先跳过Protocol实现的细节：
可以从protocol中获取到Fetch的输出流：
ProtocolOutput output = protocol.getProtocolOutput(fle);
通过输出流可以获取到抓取的状态ProtocolStatus和抓取的内容Content：

[code]
ProtocolStatus pstat = output.getStatus();
Content content = output.getContent();
[/code]
然后根据抓取的状态：
switch(pstat.getCode())
如果成功 case ProtocolStatus.SUCCESS:
如果内容不为空if (content != null)
修改抓取的页数，抓取的字节数，并且如果抓取了100页，根据pages,bytes在日志中记录抓取的速度等信息。
[code]
synchronized (Fetcher.this) { // update status
pages++;
bytes += content.getContent().length;
if ((pages % 100) == 0) { // show status every
status();
}
}
[/code]
在handleFetch进行相应的处理
ParseStatus ps = handleFetch(fle, output);
如果处理返回的状态不为空，并且成功的重定向：
if (ps != null && ps.getMinorCode() == ParseStatus.SUCCESS_REDIRECT)
获取重定向的链接并进行过滤：
[code]
String newurl = ps.getMessage();
newurl = URLFilters.filter(newurl);
[/code]
如果重定向的链接newurl不为空并且和现在的url不同：
if (newurl != null && !newurl.equals(url))
重新获取，更新refetch、url、redirCnt++;
[code]
refetch = true;
url = newurl;
redirCnt++;

[/code]
创建当前页面的FetchListEntry：
fle = new FetchListEntry(true, new Page(url, NEW_INJECTED_PAGE_SCORE), new String[0]);
如果链接页面已经转移或者临时转移：
case ProtocolStatus.MOVED: // try to redirect immediately
case ProtocolStatus.TEMP_MOVED: // try to redirect immediately
立即重定向：
处理抓取的结果：
handleFetch(fle, output);
获取重定向的url:
[code]
String newurl = pstat.getMessage();
newurl = URLFilters.filter(newurl);
if (newurl != null && !newurl.equals(url)) {
refetch = true;
url = newurl;
redirCnt++;
// create new entry.
fle = new FetchListEntry(true, new Page(url, NEW_INJECTED_PAGE_SCORE), new String[0]);
}
[/code]
过程和上面的重定向类似。
以下几种状态：
[code]
case ProtocolStatus.GONE:
case ProtocolStatus.NOTFOUND:
case ProtocolStatus.ACCESS_DENIED:
case ProtocolStatus.ROBOTS_DENIED:
case ProtocolStatus.RETRY:
case ProtocolStatus.NOTMODIFIED:
[/code]
直接交由handleFetch(fle, output);来处理
如果发生异常，logger异常信息，然后交给handleFetch处理：
case ProtocolStatus.EXCEPTION:
logError(url, fle, new Exception(pstat.getMessage()));
handleFetch(fle, output);
其他情况为未知状态，log出当前的状态，然后交给handleFetch处理
default:
LOG.warning("Unknown ProtocolStatus: " + pstat.getCode());
handleFetch(fle, output);
循环结束。
最后如果完成的线程数等于threadCount，关闭所有的插件：
[code]
synchronized (Fetcher.this) {
atCompletion++;
if (atCompletion == threadCount) {
try {
PluginRepository.getInstance().finalize();
} catch (java.lang.Throwable t) {
// do nothing
}
}
}

[/code]
我们看到Fetch到页面后大多数的处理都交给了handleFetch了。
现在我们来看看private ParseStatus handleFetch(FetchListEntry fle, ProtocolOutput output) 的代码：
根据output获取到内容和url
[code]
Content content = output.getContent();
MD5Hash hash = null;
String url = fle.getPage().getURL().toString();
[/code]
如果content为null，我们直接空的content,然后对url 用digest编码,否则对content 用digest来编码：
[code]
if (content == null) {
content = new Content(url, url, new byte[0], "", new Properties());
hash = MD5Hash.digest(url);
} else {
hash = MD5Hash.digest(content.getContent());
}
[/code]
在获取ProtocolStatus
ProtocolStatus protocolStatus = output.getStatus();
如果Fetcher不进行解析（parse），直接把抓取的页面写入磁盘
[code]
if (!Fetcher.this.parsing) {
outputPage(new FetcherOutput(fle, hash, protocolStatus),
content, null, null);
return null;
}
[/code]
否则进行parse:
首先获取页面contentType，以便根据正确编码进行Parse的:
String contentType = content.getContentType();
下面便是使用Parser进行页面提取得过程：
[code]
Parser parser = null;
Parse parse = null;
ParseStatus status = null;
try {
parser = ParserFactory.getParser(contentType, url);
parse = parser.getParse(content);
status = parse.getData().getStatus();
} catch (Exception e) {
e.printStackTrace();
status = new ParseStatus(e);
}
[/code]
如果提取页面成功：if (status.isSuccess())
将FetcherOutput提取的内容以及状态作为写入保存：
[code]
outputPage(new FetcherOutput(fle, hash, protocolStatus),
content, new ParseText(parse.getText()), parse.getData());
[/code]
否则 else 将FetcherOutput和空的parse内容保存：
[code]
LOG.info("fetch okay, but can't parse " + url + ", reason: "
+ status.toString());
outputPage(new FetcherOutput(fle, hash, protocolStatus),
content, new ParseText(""),
new ParseData(status, "", new Outlink[0], new Properties()));
[/code]
我们先跳过Parser的过程。下次我们看看如何在http协议下载的web页面，这就Protocol
插件的实现。