运行nutch中Crawl主方法报错

本文介绍了Apache Nutch爬虫工具在配置过程中遇到的问题及解决方法。主要关注于http.agent.name属性未设置导致的错误,并提供了详细的配置示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 4
depth = 5
solrUrl=null
topN = 10
Injector: starting at 2013-02-25 11:42:32
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-02-25 11:42:36, elapsed: 00:00:03
Generator: starting at 2013-02-25 11:42:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130225114239
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1395)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1280)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Generator: finished at 2013-02-25 11:42:41, elapsed: 00:00:04

Fetcher: No agents listed in 'http.agent.name' property.


解决方法:

nutch-site.xml配置中添加:

<property>
 <name>http.agent.name</name>
 <value>spider</value>
 <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
 please set this to a single word uniquely related to your organization.
 
 NOTE: You should also check other related properties:
 
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
 
 and set their values appropriately.
 
 </description>
</property>
 
<property>
 <name>http.robots.agents</name>
 <value>spider,*</value>
 <description>The agent strings we'll look for in robots.txt files,
 comma-separated, in decreasing order of precedence. You should
 put the value of http.agent.name as the first agent name, and keep the
 default * at the end of the list. E.g.: BlurflDev,Blurfl,*
 </description>
</property>

nutch-site.xml会覆盖掉nutch-default.xml文件。官方不建议修改nutch-default.xml文件

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值