Nutch爬虫部署与配置-优快云博客

本文介绍如何在Windows环境下通过Cygwin安装配置Nutch爬虫，包括环境搭建、参数配置、部署到Tomcat等步骤，实现网页爬取与搜索。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一，在windows上安装好cygwin，这个可以在网上查，尽量install完全安装

二，下载nutch1.0以上版本的tar.gz包

三，在 nutch-1.2新建文件夹 urls ，在 urls 建一文本文件，文件名任意，添加一行内容：http://lucene.apache.org/nutch/，这是要搜索的网址 (urls/nutch里的路径一定要加入"/")

四，

打开 nutch-1.2下的 conf ，找到 crawl-urlfilter.txt ，找到这两行

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*/.)*MY.DOMAIN.NAME/

红色部分是一个正则，你要搜索的网址要与其匹配，在这里我改为 +^http://([a-z0-9]*/.)*apache.org/

如果想要搜索所有的网页，可以直接用+^

编辑conf目录下的nutch-site.xml文件,该文件用于将爬虫信息告诉被抓取的网站,如果不进行设置nutch不能运行.

该文件默认为这样:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

下面是我修改后的一个例子:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>http.agent.name</name> 
<value>myfirsttest</value> 
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.agent.description</name>
<value>myfirsttest</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>

<property>
<name>http.agent.url</name>
<value>myfirsttest.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>

<property>
<name>http.agent.email</name>
<value>test@test.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>

</configuration>

五,然后打开cygwin，cd到nutch-1.2所在的文件夹

执行 "bin/nutch crawl urls -dir crawled -depth 3 -topN 50 -threads 10” 命令

或者

执行 "bin/nutch crawl urls -dir crawled -depth 3 -topN 50 -threads 10 >& crawl.log” 命令含义如下

-dir 后面跟着放爬虫爬行结果的目录，这个目录必须是之前不存在的目录

-threads 执行该命令开的线程数目

-depth 爬行深度

-topN 每一层爬行的url数目，从最前面的url开始爬

crawl.log ：日志文件，可以查看爬行过程

执行后可以看到 nutch-1.2下新增一个 crawled 文件夹，它下面有 5 个文件夹：

① / ② crawldb/ linkdb ： web link 目录，存放 url 及 url 的互联关系，作为爬行与重新爬行的依据，页面默认 30 天过期（可以在 nutch-site.xml 中配置，后面会提到）

③ segments ：一存放抓取的页面，与上面链接深度 depth 相关， depth 设为 2 则在 segments下生成两个以时间命名的子文件夹，比如 ” 20061014163012” ，打开此文件夹可以看到，它下面还有 6 个子文件夹，分别是

crawl_generate ： names a set of urls to be fetched

crawl_fetch ： contains the status of fetching each url

content ： contains the content of each url

parse_text ： contains the parsed text of each url

parse_data ： contains outlinks and metadata parsed from each url

crawl_parse ： contains the outlink urls, used to update the crawldb

④ indexes ：索引目录，我运行时生成了一个 ” part-00000” 的文件夹，

⑤ index ： lucene 的索引目录（ nutch 是基于 lucene 的，在 nutch-1.2/lib 下可以看到lucene-core-1.9.1.jar ，最后有 luke 工具的简单使用方法），是 indexs 里所有 index合并后的完整索引，注意索引文件只对页面内容进行索引，没有进行存储，因此查询时要去访问 segments 目录才能获得页面内容

六，

进行简单测试，在 cygwin 中输入 ”bin/nutchorg.apache.nutch.searcher.NutchBean apache” ，即调用 NutchBean 的 main方法搜索关键字 ”apache” ，在 cygwin 可以看到搜索出： Total hits: 29 （ hits 相当于 JDBC 的results ）

注意：如果发现搜索结果始终为 0 ，则需要配置一下 nutch-1.2/conf 的 nutch-site.xml 试试添加下面这段：（注意之前的http.agent.name必须有，如果没有这个property，则搜索结果一直为0）

<name>searcher.dir</name>

<value>D:/nutch/crawled</value> searcher.dir ：指定前面在 cygwin 中生成的 crawled 路径，即存放爬行结果的目录

</property>

‍我们还可以设置重新爬行时间（在前面提到：页面默认 30 天过期）

<name>fetcher.max.crawl.delay</name>

</property>

七，

在tomcat中部署nutch，将 ‍nutch-1.2文件夹下的nutch-1.2.war 复制到 tomcat下，然后运行tomcat，它会自动解压nutch-1.2.war文件到Tomcat6.0/webapps下，并且命名为nutch，修改/nutch/WEB-INF/classes/nutch-site.xml :

将

<nutch-conf>

</nutch-conf>

换成

<nutch-conf>

<name>http.agent.name</name>

</property>

<name>searcher.dir</name>

<value>Your_crawl_dir_path</value>

</property>

</nutch-conf>

Your_crawl_dir_path指刚才抓取网页时网页保存的文件夹