读取Nutch抓取文件中的Segments中Parse_Text的内容

最新推荐文章于 2022-08-01 12:24:47 发布

jinming

最新推荐文章于 2022-08-01 12:24:47 发布

阅读量1k

点赞数

CC 4.0 BY-SA版权

分类专栏： Nutch 文章标签： Nutch parse_text Nut

本文链接：https://blog.youkuaiyun.com/jin861625788/article/details/8779457

Nutch 专栏收录该内容

4 篇文章

订阅专栏

本文介绍了一种利用NutchBean类遍历Nutch搜索引擎中segment文件的方法，并展示了如何通过构造Hit类获取HitDetails和ParseText，从而实现对网页内容的检索与解析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

有时侯非常需要想知道nutch抓取网页后对html解析情况，最常用的是在终端使用dump或者readseg命令，但是这样做的话也很是不方便。在nutch中org.apache.nutch.searcher包中NutchBean类中main找到了一点头绪，main方法中的代码如下：

public static void main(String[] args) throws Exception {
    final String usage = "NutchBean query [<searcher.dir>]";

   if (args.length == 0) {
      System.err.println(usage);
      System.exit(-1);
    }
  //  String[] str={"中国","/home/hadoop/output"};
   // args=str;

    final Configuration conf = NutchConfiguration.create();
    if (args.length > 1) {
      conf.set("searcher.dir", args[1]);
    }
    final NutchBean bean = new NutchBean(conf);
    try {
      final Query query = Query.parse(args[0], conf);
      query.getParams().setMaxHitsPerDup(0);
      final Hits hits = bean.search(query);
      System.out.println("Total hits: " + hits.getTotal());
      final int length = (int)Math.min(hits.getLength(), 10);
      final Hit[] show = hits.getHits(0, length);
      final HitDetails[] details = bean.getDetails(show);
      final Summary[] summaries = bean.getSummary(details, query);

      for (int i = 0; i < hits.getLength(); i++) {
        System.out.println(" " + i + " " + details[i] + "\n" + summaries[i]);
      }
    } catch (Throwable t) {
       LOG.error("Exception occured while executing search: " + t, t);
       System.exit(1);
    }
    System.exit(0);
  }

这段代代码需要查传入两个参数：查询项和索引的路径，输出结果是显示查询结果的摘要。

说一下这里的HitDetails类，里面主要有 private String[] fields; private String[] values;两个成员变量，两个数组的长度相同，fields中存储变量，values中存储的是变量对应的值，功能类似hashmap的键值对，这里的details存储的有segmnet名称（以时间命名的文件夹），以及想找的那个文档的url，用来定位网页位置。将上面的代码稍加修改就勉强的可以遍历segment中的parse_text

public static void main(String[] args) throws IOException{
		FileOutputStream fos=null;
	      final NutchBean bean = new NutchBean(conf);
	
		FSDirectory dir=FSDirectory.open(new File("/home/hadoop/output/index"));
		IndexSearcher searcher=new IndexSearcher(dir);
		int num=searcher.maxDoc();
		int count=1;
		for(int i=0;i<num;i++){
			
			Hit h=new Hit(0,i+"");
			HitDetails details=bean.getDetails(h);
			ParseText text=bean.getParseText(details);
			String temp=details.toString();
			temp=temp.substring(temp.indexOf("/")+1);
			File file=new File("/home/hadoop/temp/output/"+count++);//只是一个输出文件
			if(!file.exists()) file.createNewFile();
			fos=new FileOutputStream(file);
			fos.write((temp+"\r\n"+text.getText()+"\r\n").getBytes());
			}
		fos.close();
	System.out.println("Done"+count+" "+num);
		return;
	}

上面的Hit类的构造方法中的两个参数为int 类型的indexNo和String UniqeKey，indexNo这个值目前还不知道有什么用，可能我建的索引的量还没有达到使它的值变为1，第二个参数是文档号，一般从0开始往上数。通过构造Hit来得到HitDetails，这样就可以通过NutchBean取得segment中的值；