nutch 初次接触

最新推荐文章于 2025-08-10 14:29:07 发布

原创最新推荐文章于 2025-08-10 14:29:07 发布 · 133 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#XSL #Hadoop #Tomcat #XML #lucene

Lucene 专栏收录该内容

6 篇文章

订阅专栏

最近一直在看lucene，了解到了 nutch，也同时了解了下 hadoop。

要在windows平台上使用nutch、hadoop ，需要安装cygwin，这个工具的安装很多blog上面都有介绍。我就不详细说了。关键是大家在安装的时候选择把openssh也安装上。

这里先只说说我是如何运行起nutch的。

首先说下我的目录结构

cygwin/

bin

...

hadoop/

bin

....

nutch-0.9/

bin

...

javaEEServer/

tomcat6.0/

bin

....

首先进入nutch-0.9，创建文件夹urls，然后在该文件夹下面创建文件urls.txt,然后在该文本文件中输入你想抓取的站点，我这里以www.whieb.com 为例子。然后在进入nutch-0.9/conf下面，修改nutch-site.xml为如下。

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<!--修改前-->
<configuration>



</configuration>



============
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<!--修改后-->
<configuration>
 <property> 

<name>http.agent.name</name>

<value>nutch-0.9</value>

<description></description>

</property>

<property>

<name>http.agent.description</name>

<value>my agent</value>

<description></description>

</property>

<property>

<name>http.agent.url</name>

<value>http://www.whieb.com</value>

<description></description>

</property>

<property>

<name>http.agent.email</name>

<value>esteem_84@163.com</value>

<description></description>

</property>


</configuration>

然后在修改文件craw-urlfilter.txt文件为：

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://www.whieb.com/

# skip everything else
-.

然后在nutch-0.9目录下建立一个logs文件夹，用来存放日志文件。

到此文件的目录应该为：

f:\nutch-0.9

bin

conf

...

logs

urls

然后启动cygwin，

执行如下命令：

其中命令：

bin/nutch crawl urls -dir mydir -depth 4 -threads 4 -topN 50 >&log1.log 为进行抓取

urls指定抓取的网站的目录

-dir指定抓取结果存放到哪里

-depth 指定抓取深度

-threads 指定开启多少个线程进行抓取

-topN 指定每个站点最多抓取多少

>&log1.log 指定日志存放的路径

执行完毕这些后，我们会发现在nutch-0.9目录下面会出现一个 mydir的目录，里面就是抓取获得的数据

然后，我们开始部署nutch的web应用，将nutch-0.9目录下面的nutch-0.9.war包拷贝到tomcat下面的webapps下面，然后启动服务器，tomcat会主动将该war包解包，我们进入解开的文件夹，进入到WEB-INF下面的class目录，修改nutch-site.xml为：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property> 
  <name>http.agent.name</name> 
  <value>nutch-0.9</value> 
  <description></description> 
</property> 
<!-- file properties --> 
<property> 
 <name>searcher.dir</name>
  <value>F:\\nutch-0.9\\mydir</value>
  <description></description> 
</property> 
</configuration>

然后在重新启动tomcat，在浏览器中运行:

http://127.0.0.1:8080/nutch-0.9,应该会看到如下的页面：