crawl started in: crawl
rootUrlDir = urls
threads = 4
depth = 5
solrUrl=null
topN = 10
Injector: starting at 2013-02-25 11:42:32
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-02-25 11:42:36, elapsed: 00:00:03
Generator: starting at 2013-02-25 11:42:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130225114239
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1395)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1280)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Generator: finished at 2013-02-25 11:42:41, elapsed: 00:00:04
Fetcher: No agents listed in 'http.agent.name' property.
解决方法:
nutch-site.xml配置中添加:
<property>
<name>http.agent.name</name>
<value>spider</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>spider,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
nutch-site.xml会覆盖掉nutch-default.xml文件。官方不建议修改nutch-default.xml文件