运行nutch中Crawl主方法报错

最新推荐文章于 2021-08-20 15:12:25 发布

wangzhaodong001

最新推荐文章于 2021-08-20 15:12:25 发布

阅读量1.2k

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/wangzhaodong001/article/details/8608806

本文介绍了Apache Nutch爬虫工具在配置过程中遇到的问题及解决方法。主要关注于http.agent.name属性未设置导致的错误，并提供了详细的配置示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 4
depth = 5
solrUrl=null
topN = 10
Injector: starting at 2013-02-25 11:42:32
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-02-25 11:42:36, elapsed: 00:00:03
Generator: starting at 2013-02-25 11:42:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130225114239
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1395)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1280)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Generator: finished at 2013-02-25 11:42:41, elapsed: 00:00:04

Fetcher: No agents listed in 'http.agent.name' property.

解决方法：

nutch-site.xml配置中添加：

<property>
<name>http.agent.name</name>
<value>spider</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.robots.agents</name>
<value>spider,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>