xp下配置nutch1.0

最新推荐文章于 2025-07-05 00:20:56 发布

原创最新推荐文章于 2025-07-05 00:20:56 发布 · 137 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#XP #Tomcat #lucene #Apache #XML

nutch 专栏收录该内容

2 篇文章

订阅专栏

本文详细介绍了在XP系统下配置运行Nutch1.0所需的前提环境，包括安装JDK、Tomcat和cygwin，下载并配置Nutch，以及服务器配置和中文乱码解决方法。


  在xp下配置运行nutch1.0

必要的前提环境：

1、  从sun官网下载JDK1.6并安装，配置系统环境变量JAVA_HOME=”JDK的安装路径”

2、  从http://tomcat.apache.org/下载tomcat6.0并安装

一、下载安装cygwin

从http://www.cygwin.com/下载cygwin并安装

在我的机器上本地安装老是出错，如果碰到这种情况，试试在线安装



除了选择目录外，其他的用默认设置即可

二、下载nutch并配置

1、从http://www.apache.org/dyn/closer.cgi/lucene/nutch/下载nutch1.0

2、将nutch1.0解压并复制到cygwin的安装目录修改目录名为nutch（也可以不修改）

3、在nutch目录下新建urls目录用于存放搜索的网址，并在urls目录下新建url.txt，然后在txt文件中指定开始搜索的完整网址。

4、配置nutch\conf\nutch_site.xml，在<configuration> </configuration>之间添加

<property> 

  <name>http.agent.name</name> 

  <value>searcher</value> 

  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 

  please set this to a single word uniquely related to your organization. 



  NOTE: You should also check other related properties: 



     http.robots.agents 

     http.agent.description 

     http.agent.url 

     http.agent.email 

     http.agent.version 



  and set their values appropriately. 



  </description> 

</property> 



<property> 

  <name>http.agent.description</name> 

  <value>windows</value> 

  <description>Further description of our bot- this text is used in 

  the User-Agent header.  It appears in parenthesis after the agent name. 

  </description> 

</property> 



<property> 

  <name>http.agent.url</name> 

  <value>http://www.bitren.com/</value> 

  <description>A URL to advertise in the User-Agent header.  This will 

   appear in parenthesis after the agent name. Custom dictates that this 

   should be a URL of a page explaining the purpose and behavior of this 

   crawler. 

  </description> 

</property> 



<property> 

  <name>http.agent.email</name> 

  <value>fiwiner@126.com</value> 

  <description>An email address to advertise in the HTTP 'From' request 

   header and User-Agent header. A good practice is to mangle this 

   address (e.g. 'info at example dot com') to avoid spamming. 

  </description> 

</property>

5、修改nutch\conf\crawl-urlfilter.txt

找到：#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

修改为：#+^http://([a-z0-9]*\.)*

6、启动cygwin执行命令，转到nutch目录（cd /nutch)然后执行：

bin/nutch crawl urls -dir crawled -depth 4 -threads 4 -topN 50 >&crawledlog.log  开始抓取页面



urls指定抓取的网站的目录

-dir指定抓取结果存放到哪里

-depth 指定抓取深度

-threads 指定开启多少个线程进行抓取

-topN 指定每个站点最多抓取多少

>&crawledlog.log 指定日志存放的路径

三、配置服务器

         1、启动服务器，将nutch目录下的nutch-1.0.war改名为nutch.war，再将其复制到tomcat下面的webapps目录下，然后启动服务器，tomcat会主动将该war包解包。进入解压后的文件夹，修改WEB-INF\class\ nutch-site.xml：

在<configuration>  </configuration> 之间添加，粗体部分是必须的，网上流传的很多教程没有提到：

 <property>   

  <name>http.agent.name</name>   

  <value>nutch</value>   

  <description></description>   

</property>   

<!-- file properties -->   

<property>   

 <name>searcher.dir</name>  



<!-- 下面的目录自己修改成相对应的-->  

 <value> ****\cygwin\nutch\crawled</value>  

  <description></description>   

</property>   

         2、解决中文乱码

                   修改tomcat\server.xml，找到Connector属性并添加：

<Connector port="8080" protocol="HTTP/1.1" 

               connectionTimeout="20000" 

               redirectPort="8443" 

               URIEncoding="UTF-8" 

               useBodyEncodingForURI="true" />