转载本文请注明出处:http://blog.youkuaiyun.com/pwlazy
Command "crawl": net.nutch.tools.CrawlTool
CrawlTool is a class that does little more than lash together the steps you'd do manually for a whole-web crawl. It consists of two simple static methods, plus a main(). Here is an outline of its operations:
- start logger: LogFormatter.getLogger(...)
- load "crawl-tool.xml" config file: NutchConf.addConfResource(...)
- read arguments from command-line
- create a new web db: WebDBAdminTool.main(...)
- add rootURLs into the db: WebDBInjector.main(...)
- for 1 to depth (=5 by default):
- generate a new segment: FetchListTool.main(...)
- fetch the segment: Fetcher.main(...)
- update the db: UpdateDatabaseTool.main(...)
- comment:
"Re-fetch everything to get complete set of incoming anchor texts"
- delete all old segment data: FileUtil.fullyDelete(...)
- make a single segment with all pages:FetchListTool.main(...)
- re-fetch everything: Fetcher.main(...)
- index: IndexSegment.main(...)
- dedup: DeleteDuplicates.main(...)
- merge: IndexMerger.main(...)
Translating this into the equivalent "nutch" script commands, we can see how similar this is to the whole-web crawling process:
- (start logger, etc)
- bin/nutch admin db -create
- bin/nutch inject db ...
- (for 1 to depth:)
- bin/nutch generate ...
- bin/nutch fetch ...
- bin/nutch updatedb ...
- (call net.nutch.FileUtil.fullyDelete(...))
- bin/nutch generate ...
- bin/nutch index ...
- bin/nutch dedup ...
- bin/nutch merge ...
If we wished to customize CrawlTool, we could easily copy its contents to another class, edit, compile, then run it via "bin/nutch" using its full class name. But, as you can see, there isn't much here to customize! The actual work of makingHTTP requests is occurs inside Fetcher.main().
Let's examine the steps that occur before Fetcher.main(...), then dive into the crawler itself.
命令 crawl 对应 net.nutch.tools.CrawlTool
CrawlTool所做的事情比你执行internet crawl的那些捆绑在一起的手动步骤多一点。它包含两个简单的静态方法
另外加上一个main方法,以下是main方法的操作概要
















可以将上面的步骤转换为等同的nutch脚本命令,我们能看出这和internet crawl何其相似












如果我们希望定制CrawlTool,我们可以简单地拷贝它的内容到另外一个类,编辑,编译,然后通过"bin/nutch"并赋上类名来运行。但正如你所见,这大部分并非定制。发出http请求的工作是在在Fetcher.main().中。我们先来看看在Fetcher.main()之前的一些步骤,然后再深入研究crawl本身。
注:本人英文水平有限,翻译不当之处请批评指正,谢谢