Dissecting The Nutch Crawler - Command "crawl": net.nutch.tools.CrawlTool

最新推荐文章于 2025-07-12 10:38:55 发布

iteye_15968

最新推荐文章于 2025-07-12 10:38:55 发布

阅读量53

点赞数

文章标签： .net Web XML 脚本 Blog

本文详细解析了NutchCrawler的工作流程，包括如何通过一系列工具完成网页爬取、索引和去重等关键步骤，并提供了对应的nutch脚本命令。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

英文原文出处： DissectingTheNutchCrawler
转载本文请注明出处：http://blog.youkuaiyun.com/pwlazy

Command "crawl": net.nutch.tools.CrawlTool

CrawlTool is a class that does little more than lash together the steps you'd do manually for a whole-web crawl. It consists of two simple static methods, plus a main(). Here is an outline of its operations:

- start logger:                        LogFormatter.getLogger(...)
- load "crawl-tool.xml" config file:   NutchConf.addConfResource(...)
- read arguments from command-line
- create a new web db:                   WebDBAdminTool.main(...)
- add rootURLs into the db:             WebDBInjector.main(...)
- for 1 to depth (=5 by default):      
    - generate a new segment:            FetchListTool.main(...)
    - fetch the segment:                  Fetcher.main(...)
    - update the db:                     UpdateDatabaseTool.main(...)
- comment: 
    "Re-fetch everything to get complete set of incoming anchor texts"
- delete all old segment data:         FileUtil.fullyDelete(...)
- make a single segment with all pages:FetchListTool.main(...)
- re-fetch everything:                    Fetcher.main(...)
- index:                                 IndexSegment.main(...)
- dedup:                                 DeleteDuplicates.main(...)
- merge:                                 IndexMerger.main(...)

Translating this into the equivalent "nutch" script commands, we can see how similar this is to the whole-web crawling process:

- (start logger, etc)
- bin/nutch admin db -create
- bin/nutch inject db ...
- (for 1 to depth:)
   - bin/nutch generate ...
   - bin/nutch fetch ...
   - bin/nutch updatedb ...
- (call net.nutch.FileUtil.fullyDelete(...))
- bin/nutch generate ...
- bin/nutch index ...
- bin/nutch dedup ...
- bin/nutch merge ...

If we wished to customize CrawlTool, we could easily copy its contents to another class, edit, compile, then run it via "bin/nutch" using its full class name. But, as you can see, there isn't much here to customize! The actual work of makingHTTP requests is occurs inside Fetcher.main().

Let's examine the steps that occur before Fetcher.main(...), then dive into the crawler itself.

命令 crawl 对应 net.nutch.tools.CrawlTool

CrawlTool所做的事情比你执行internet crawl的那些捆绑在一起的手动步骤多一点。它包含两个简单的静态方法
另外加上一个main方法，以下是main方法的操作概要

LogFormatter.getLogger(...) // 开始log

NutchConf.addConfResource(...) // 加载"crawl-tool.xml"

。。。。。。 // 读命令行参数

WebDBAdminTool.main(...) // 产生一个新的webdb

WebDBInjector.main(...) // 将rootURLS加到webdb中

。。。。。。 // 开始循环,从1到你设定的搜索网页的深度，默认是5

FetchListTool.main(...) // 产生一个新的segment

Fetcher.main(...) // 获取segment

UpdateDatabaseTool.main(...) // 更新webdb

。。。。。。 // 再抓取数据获得一个完整锚文本集

FileUtil.fullyDelete(...) // 删除所有的旧的segment数据

FetchListTool.main(...) // 用所有页面做单个segment

Fetcher.main(...) // 重新检索

IndexSegment.main(...) // 开始建索引

DeleteDuplicates.main(...) // 删除重复

IndexMerger.main(...) // 合并索引

可以将上面的步骤转换为等同的nutch脚本命令，我们能看出这和internet crawl何其相似

- (startlogger , etc)

- bin / nutchadmindb - create

- bin / nutchinjectdb ...

- ( for 1 todepth : )

- bin / nutchgenerate ...

- bin / nutchfetch ...

- bin / nutchupdatedb ...

- (callnet . nutch . FileUtil . fullyDelete( ... ))

- bin / nutchgenerate ...

- bin / nutch index ...

- bin / nutchdedup ...

- bin / nutchmerge ...

如果我们希望定制CrawlTool,我们可以简单地拷贝它的内容到另外一个类，编辑，编译，然后通过"bin/nutch"并赋上类名来运行。但正如你所见，这大部分并非定制。发出http请求的工作是在在Fetcher.main().中。我们先来看看在Fetcher.main()之前的一些步骤，然后再深入研究crawl本身。

注：本人英文水平有限，翻译不当之处请批评指正，谢谢