1、下载相关软件,并解压
版本号如下:
(1)apache-nutch-2.3
(2) hadoop-1.2.1
(3)hbase-0.92.1
(4)solr-4.9.0
并解压至/opt/jediael。
若要下载最新的开发版本nutch,可以进行以下操作
svn co https://svn.apache.org/repos/asf/nutch/branches/2.x
2、安装hadoop1.2.1集群环境
见http://blog.youkuaiyun.com/jediael_lu/article/details/38926477
3、安装hbase0.92.1集群环境
见http://blog.youkuaiyun.com/jediael_lu/article/details/43086641
4、Nutch的配置
(1)vi /usr/search/apache-nutch-2.3/conf/nutch-site.xml
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<pre name="code" class="html"><property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
(2)vi /usr/search/apache-nutch-2.3/ivy/ivy.xml
默认情况下,此语句被注释掉,将其注释符号去掉,使其生效。
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
gora-hbase 0.5对应hbase0.94.12
根据需要修改hadoop的版本:
<dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.1" conf="*->default”>
<dependency org="org.apache.hadoop" name="hadoop-test" rev="1.2.1" conf="test->default”>
(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties
添加以下语句:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
以上三个步骤指定了使用HBase来进行存储。
(4)根据需要修改网页过滤器
vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt
vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt
将
# accept anything else
+.