Nutch基本命令

1:nutch读取hbase数据导出文本文件:./nutch readdb -dump /data/nutch_db/1108 -crawlId TestCrawl -content
会执行一个mr程序,/data/nutch_db/1108是mr的输出路径TestCrawl是hbase表名的前半部分。
2:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex      run the solr indexer on parsed batches
 solrdedup      remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
3:./crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
4,
    ./nutch inject /nutch/urls/seed.txt -crawlId 7day把url放入对应的hbase以bbs开头的bbs_webpage表中
    ./nutch generate -topN 5 -crawlId 7day
    ./nutch fetch -all -crawlId 7day -threads 20
    ./nutch parse -all -crawlId 7day
    ./nutch updatedb -all -crawlId 7day
    ./nutch index -D solr.server.url= http://192.168.4.129:8983/solr/ -all -crawlId 7day
    ./nutch  readdb -crawlId 7day_beijing -dump /home/nutch_output/beijing_1/
5, gora-hbase-mapping.xml该文件定义了列族及列的含义
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值