在安装使用nutch的过程中我遇到了不少问题,我使用的平台是Ubuntu 12.04 32位,nutch安装环境为jdk1.8.0_121,hbase0.98.8,solr4.10.3。
参考博客为:
1、http://blog.youkuaiyun.com/freedomboy319/article/details/44172277
2、http://blog.youkuaiyun.com/a973893384/article/details/49666063
目前已经基本安装成功,但是在抓取时还是会出现一些问题:
IndexingJob: done.
SOLR dedup -> http://localhost:8983/solr
~/lab1/NUTCH_HOME/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://localhost:8983/solr
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:~/lab1/NUTCH_HOME/runtime/local/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:~/lab1/NUTCH_HOME/runtime/local/lib/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Exception in thread "main" java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local365318350_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:383)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:393)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:403)
Error running:
~/lab1/NUTCH_HOME/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://localhost:8983/solr
Failed with exit value 1.
经过查询发现是有SLF4J冲突文件,只要删除其中一个冲突问题就解决了,也可以正常爬到数据。
但是index还是无法建立,在同一个地方继续报错,所以需要改进
SOLR dedup -> http://localhost:8983/solr/
/home/silvia/lab1/NUTCH_HOME/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://localhost:8983/solr/
Exception in thread "main" java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local2020123009_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:383)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:393)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:403)
Error running:
/home/silvia/lab1/NUTCH_HOME/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://localhost:8983/solr/
Failed with exit value 1.
待更新。。。