使用代码查看Nutch爬取的网站后生成的SequenceFile信息

本文介绍了一个使用Java实现的Nutch SequenceFile读取示例,通过具体代码展示了如何从Nutch生成的SequenceFile中读取爬虫数据,并解析出URL、状态等关键信息。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

 

必须针对data文件中的value类型来使用对应的类来查看(把这个data文件,放到了本地Windows的D盘根目录下).

代码:

 1 package cn.summerchill.nutch;
 2 import java.io.IOException;
 3 
 4 import org.apache.hadoop.conf.Configuration;
 5 import org.apache.hadoop.fs.FileSystem;
 6 import org.apache.hadoop.fs.Path;
 7 import org.apache.hadoop.io.SequenceFile;
 8 import org.apache.hadoop.io.Text;
 9 import org.apache.nutch.crawl.CrawlDatum;
10 import org.apache.nutch.crawl.Inlinks;
11 import org.apache.nutch.parse.ParseData;
12 import org.apache.nutch.parse.ParseText;
13 import org.apache.nutch.protocol.Content;
14 /**
15  * 读取nutch生成的sequencefile文件
16  * @author Administrator
17  *
18  */
19 public class SeFileReader {
20     public static void main(String[] args) throws IOException {  
21         Configuration conf=new Configuration();  
22         Path dataPath=new Path("D:\\data");  
23         FileSystem fs=dataPath.getFileSystem(conf);  
24         SequenceFile.Reader reader=new SequenceFile.Reader(fs,dataPath,conf);  
25         Text key=new Text();  
26         CrawlDatum value=new CrawlDatum();  
27         //Content value = new Content();
28         //Inlinks value = new Inlinks();
29         //ParseText value = new ParseText();
30         //ParseData value = new ParseData();
31         while(reader.next(key,value)){  
32             System.out.println("key->\n"+key);  
33             System.err.println("value->\n"+value); 
34             try {
35                 Thread.sleep(1000);
36             } catch (InterruptedException e) {
37                 e.printStackTrace();
38             }
39             System.out.println("=======================================");
40         }
41         reader.close();  
42     } 
43 }

运行结果:

key->
http://bbs.superwu.cn/
value->
Version: 7
Status: 2 (db_fetched)
Fetch time: Tue Nov 08 08:31:30 CST 2016
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.6153846
Signature: 22defcd7cb4e7b1dc8a16a0a2f339ecb
Metadata: 
     Content-Type=application/xhtml+xml
    _pst_=success(1), lastModified=0
    _rs_=610

=======================================
value->
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Oct 09 08:31:35 CST 2016
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.23076925
Signature: null
Metadata: 
 
key->
http://bbs.superwu.cn/archiver/
=======================================
key->
http://bbs.superwu.cn/forum.php
value->
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Oct 09 08:31:35 CST 2016
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.15384616
Signature: null
Metadata: 
 
=======================================

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值