nutch搜索引擎的搭建

软件先安装好,NUTCH_JAVA_HOME是你java的安装路径设置好

 

然后开始动手。

 

在nutch目录下放个urls.txt存放你要扒取的网页

 

crawl-urlfilter.txt我修改如下:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*/.)*

 

nutch-site.xml修改如下

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>

<name>http.agent.name</name>

<value>Jennifer</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

     http.robots.agents

     http.agent.description

     http.agent.url

     http.agent.email

     http.agent.version

and set their values appropriately.

</description>

</property>

<property>

<name>http.agent.description</name>

<value>Jennifer</value>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

</description>

</property>

<property>

<name>http.agent.url</name>

<value>Jennifer</value>

<description>A URL to advertise in the User-Agent header. This will

   appear in parenthesis after the agent name. Custom dictates that this

   should be a URL of a page explaining the purpose and behavior of this

   crawler.

</description>

</property>

<property>

<name>http.agent.email</name>

<value>Jennifer</value>

<description>An email address to advertise in the HTTP 'From' request

   header and User-Agent header. A good practice is to mangle this

   address (e.g. 'info at example dot com') to avoid spamming.

</description>

</property>
</configuration>

 

cygwin下输入:bin/nutch crawl urls.txt -dir /myDir 3 >& crawl.log

 

 这样就会生成与nutch目录平行的myDir目录 里面是扒取的结果。crawl.log使用nutch根目录下的日志文件

 

然后将nutch根目录下的nutch.rar部署的tomcat上,

 

在部署的文件里WEB-INF/classes下的nutch-site.xml修改如下

 

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>

<name>searcher.dir</name>

<value>E:/myDir</value>

</property>
</configuration>

 

KO  开始享受吧~~~

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值