1. 安装和编译Nutch
打开/etc/profile 配置ANT_HOME 和 PATH
export ANT_HOME = /usr/local/ant
export PATH = $PATH:$ANT_HOME/bin
执行ant -version查看版本信息: Apache Ant(TM) version 1.9.2 compiled on July 8 2013
配置ivy/ivy.xml, 打开gora-hbase依赖, ivy用来管理nutch依赖包的下载
<!-- Uncomment this to use HBase as Gora backen. -->
<dependency
org
=
"org.apache.gora"
name
=
"gora-hbase"
rev
=
"0.3"
conf
=
"*->
default" />
配置conf/gora.properties,将HBase作用gora的后台默认存储
/** Ensure that HBaseStroe is set as the default datastore in gora.properties. */
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
配置conf/nutch-site.xml
<property>
<name>http.agent.name</name>
<value>My Spider Nutch</value>
</property><property>
<name>http.robots.agents</name>
<value>My Spider Nutch,*</value></property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property><property>
<name>generate.batch.id</name>
<value>1000</value>
</property><property><name>http.proxy.host</name><value>www-proxy.jp.oracle.com</value><description>The proxy hostname . If empty, no proxy is used.</description></property><property><name>http.proxy.port</name><value>80</value><description>The proxy port.</description></property><property>
<name>hadoop.tmp.dir</name>
<value>/scratch/bigData/temp/hadoop_tmp</value>
<description>hadoop root folder</description>
</property>
配置build.xml,使得ANT能够通过代理下载依赖包
<!-- target: proxy-setting ========================================== -->
<target name= "proxy-setting" ><property name= "proxy.host" value= "www-proxy.jp.oracle.com" /><property name= "proxy.port" value= "80" /><setproxy proxyhost= "${proxy.host}" proxyport= "${proxy.port}" /></target><!-- target: ivy-download ============================================ --><target name= "ivy-download" description= "To download ivy" unless= "offline" depends= "proxy-setting" ><available file= "${ivy.jar}" property= "ivy.jar.found" /><antcall target= "ivy-download-unchecked" /></target>
最后一步,执行ANT命令,下载网络上的依赖包和编译。大概30分钟,编译完后有300M
ant >& build.log
进入
nutch/runtime/local目录,执行以下命令验证nutch安装成功
bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
......
2. 安装HBase
配置conf/hbase-env.sh,注释掉JAVA_HOME使其指向你的JDK
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
配置conf/hbase-site.xml
<property>
<name>hbase.rootdir</name><value>file:/// bigdata / hbase </value></property><property><name>hbase.zookeeper.property.dataDir</name><value>/ bigdata/ zookeeper </value></property>
启动和停止Hbase服务
bin/start-hbase.sh;
bin/stop-hbase.sh
3. 安装Solr
由于nutch和solr默认的依赖是solr 3.x, 需要在schema.xml中将其替换为solr 4.x
将nutch/runtime/local/conf目录下的
schema-solr4.xml里面的内容复制到当前目录的
schema.xml里面
将
nutch/runtime/local/conf目录下的
schema-solr4.xml里面的内容复制到
$SOLR_HOME/example/solr/collection1/conf/schema.xml
注: Nutch2.x自带的shema-solr.xml中少了一个_version_ field字段,将以下内容加到<!-- core fileds -->的boost filed之后
<field name="_version_" type="long" indexed="true" stored="true"/>
进入solr/example目录启动solr服务
java -jar start.jar
打开以下的solr主页验证安装成功
4. Nutch 2.2.1 + HBase 0.90.6 + Solr 4.4.0 集成
在nutch-2/runtime/local/bin/下面执行nutch的crawl命令进行网页抓取
批处理命令: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
单步执行命令:
bin/nutch inject urls
bin/nutch generate -topN 3
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
bin/nutch generate -topN 3
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
本文详细介绍了如何将Nutch2.2.1、HBase0.90.6和Solr4.4.0进行集成部署,包括Nutch的安装与配置、HBase的配置与启动、Solr的安装与schema调整,以及最终的网页抓取流程。
261

被折叠的 条评论
为什么被折叠?



