Nutch二次开发之定制爬取网站信息

最新推荐文章于 2017-08-02 17:11:16 发布

置顶 cuikai314

最新推荐文章于 2017-08-02 17:11:16 发布

阅读量9.8k

点赞数

CC 4.0 BY-SA版权

分类专栏： nutch插件 nutch 二次开发 parse-html parsr nutch二次开发搜索引擎文章标签： string url properties encoding 正则表达式 filter

本文链接：https://blog.youkuaiyun.com/cuikai314/article/details/7763265

nutch插件同时被 3 个专栏收录

2 篇文章

订阅专栏

nutch

2 篇文章

订阅专栏

二次开发

2 篇文章

订阅专栏

本文介绍了一种定制化的网页内容提取方法，通过配置文件指定不同网站的URL匹配规则及内容过滤条件，实现了针对特定网站的有效内容抓取。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第二篇文章探讨的是定制爬取的信息，之前的分析我们得到，爬取的框架主要包括：

1）inject把自己写的url文件中的url经过过滤和正规化注入crawldb中，保存到crawldb目录下

2）generate从crawldb中把url提取出来经过过滤正规化生成fetchlist队列，保存到segments的crawl_generate文件夹下

3）fetch根据fetchlist队列将url对应的网页信息提取下来保存在segments下的crawl_fetch和crawl_content文件夹下

4）parse根据crawl_content内容的信息对网页进行解析，生成parse_data，parse_text，crawl_parse文件夹

5）update根据如上解析的内容更新crawldb

6）invertlinks计算反向链接

这里我想把定制的内容建立索引，提高搜索的精度。那建立索引的文本内容是在parse阶段解析出来，保存到parse_text的，请参考我的上一篇文章

Nutch 二次开发之parse正文内容。

由于基于boilerpipe工具提取正文信息的方案提取精度不能不高，所以基于定制的方案将具有优势。为了对不同网站不同domain进行爬取，为了提高爬取的效率，我们把定制信息记录到配置文件里。首先来看一下配置文件site-contentfilter.properties如下：

isCustomized = true

site = 163|qq
regex.163 = http://[a-z0-9]*\.163\..*?
regex.qq = http://[a-z0-9]*\.qq\..*?

regex.163.news = http://news\.163\.com/\\d{2}/\\d{4}/\\d{2}/.*\.html
filter.163.news = <p@class="summary">|<div@id="endText">
regex.163.tech = http://tech\.163\.com/\\d{2}/\\d{4}/\\d{2}/.*\.html
filter.163.tech = <div@id="endText">
regex.qq.news = http://news\.qq\.com/.*?
filter.qq.news = <div@id="Cnt-Main-Article-QQ">
regex.qq.tech = http://tech\.qq\.com/.*?
filter.qq.tech = <div@id="Cnt-Main-Article-QQ">

将配置文件放到conf目录下,第一行代表是否采用定制方式，还是采用boilerpipe通用方式，第二行代表爬取的host，对url进行host的判断，再缩小匹配url的host-domain范围，为了方便提高效率的作用。第三块我们看到有163的news板块和l63的tech板块的url正则表达式和filter定制tagnode信息。

在此我们开始编写定制网页信息工具类的代码：MainContentUtils.java,首先我们来看类成员变量（静态成员变量）和实例成员变量

private static InputStream in = null;
	private static Properties p = null;
	private static boolean isCustomized = true;
	private static String[] sites = null;//网站过滤
	private static String[] sitesRegex = null;//站点对应的url用于过滤
	private static Parser parser = null;//htmlParser 用于解析网页
	
	private StringBuffer sb = null;
	private byte[] contentBytes = null;
	private String encoding = null;
	private String url = null; //url用来正则表示式检测 
	private int blockCount = 0; //提取块数
	private String desFilterString = null; //目标filter 
	private String desSite = null; //目标站点
	private String desRegex = null; //目标正则表达式
	private String[] desFilters = null;//目标filter
	private ArrayList<String> regexKeyCandidates = null;//对应站点的候选regex在properties文件中的key

在类成员变量中，主要是完成配置文件的读取工作，并初始化相应的属性，而实例成员变量则是对于每一个输入的网页进行提取的具体的变量。下面来看静态初始化块：

static{
			try {
				in = new BufferedInputStream(
						new FileInputStream("conf/site-contentfilter.properties"));//读入配置文件
				p = new Properties();//新建properties对象
				p.load(in);配置文件load入properties对象
			} catch (IOException e) {
				e.printStackTrace();
			}
			String isCustomizedString = p.getProperty("isCustomized");//是否使用定制方案
			System.out.println("isCustomizedString:" + isCustomizedString);
			try{
				if (isCustomizedString.equals("true")) {
					isCustomized = true;
					System.out.println("isCustomized:" + isCustomized);
					String sitesString = p.getProperty("site");//获得站点，在如上配置文件中返回 "193|qq"
					sites = sitesString.split("\\|");
					String regexKey = null;
					sitesRegex = new String[sites.length];//获取站点对应的url
					for(int i = 0; i < sites.length; i++) {
						regexKey =  "regex." + sites[i];
						sitesRegex[i] = p.getProperty(regexKey);
					}
				}
				else
					isCustomized = false;
			}catch (Exception e) {
				e.printStackTrace();
			}
	}

如下是构造器

public MainContentUtils(StringBuffer sb, String url, byte[] content, String encoding) {
  this.sb = sb;//返回的正文部分信息的StringBuffer
  this.url = url;//对此url进行内容提取
  this.contentBytes = content;//对应词url的内容源码byte[]
  this.encoding = encoding;//此内容源码的编码方案
 }

下面来看一下主要的接口来获得正文部分的信息:

/**
	 * 获得mainContent的主要接口
	 * @param sb 返回的string
	 * @param url 根据url判断定制的filter
	 * @param content 输入网页byte[]
	 * @param encoding 网页的编码方式
	 * @return 获得text成功 true
	 */
	public boolean getMainContent(){
		if(isCustomized) {
			try {
				parser = new Parser(new String(contentBytes, encoding));//创建parser对象
				parser.setEncoding(encoding);
			} catch (ParserException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
			catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
				
			
			if(!analyzeURL(url)){//判断此url是否是应该解析的url，目录页被过滤掉，返回false
				return false;
			}
			analyzeFilterString();//分析定制tagnode，可能是一块tagnode，也可能是两块tagnode:摘要一块，正文一块
			for(int i = 0 ; i < blockCount; i++) {
				if (!getMainContentBlock(sb, i))//对于每块不同的tagnode，获取信息String
					return false;
			}
			return true;
		}else {
			try {
				if(!analyzeURL(url)){
					return false;
				}
				sb.append(BoilerpipeUtils.getMainbodyTextByBoilerpipe(//使用boilerpipe进行内容解析
						new InputSource(new ByteArrayInputStream(contentBytes))));
				return true;
			} catch (BoilerpipeProcessingException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			} catch (SAXException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
			
		}
		return false;
	}

从上面的代码我们看到的流程是：先对url进行过滤判断，过滤掉目录页的提取，之后对分块的内容进行提取，代码分别如下：

	private void analyzeFilterString() {
		// TODO Auto-generated method stub
		desFilters = desFilterString.split("\\|");
		blockCount = desFilters.length;
	}

	/***
	 * 根据输入的url找到对应的NodeFilter
	 * @param url 输入的url
	 * @return
	 */
	private boolean analyzeURL(String url) {
		// TODO Auto-generated method stub
		this.url = url;
		Pattern pattern = null;
		for(int i = 0; i < sites.length; i++) {
			pattern = Pattern.compile(sitesRegex[i]);
			Matcher matcher = pattern.matcher(url);
			if (matcher.matches()) {
				desSite = sites[i];
				getRegexBySite(desSite);
				return getFilterByUrl(url);
			}
		}
		return false;//没有此站点
	}

	/***
	 * 
	 * @return 查找到此站点下的分类filter 返回true
	 */
	private boolean getFilterByUrl(String url) {
		// TODO Auto-generated method stub
		String regex = null, key = null;
		Iterator<String> it = regexKeyCandidates.iterator();
		while(it.hasNext()) {
			key = it.next();
			regex = p.getProperty(key);
			Pattern pattern = Pattern.compile(regex);
			Matcher matcher = pattern.matcher(url);
			if(matcher.matches()) {
				desRegex = regex;
				key = "filter." + key.substring(key.indexOf(".")+1, key.length());
				desFilterString = p.getProperty(key);
				return true;//匹配到此站点此分类
			}
		}
		return false;//此站点没有此分类
	}

	/***
	 * 根据url找到对应的候选正则表达式数组
	 * @param site url对应的站点
	 */
	private void getRegexBySite(String site){
		regexKeyCandidates = new ArrayList<String>();
		Enumeration<Object> e = p.keys();
		String key, value;
		while(e.hasMoreElements()){
			key = (String) e.nextElement();
			if (key.startsWith("regex." + site)
					&& (key.indexOf(".")!= key.lastIndexOf("."))) {
				regexKeyCandidates.add(key);
			} 
		}
		
	}
	/***
	 * 从一个网页得到几块内容
	 * @param sb 回传的文本字符串
	 * @param index 第几块block
	 * @return true获取成功
	 */
	private boolean getMainContentBlock(StringBuffer sb, int index) {
	    String filter = desFilters[index].trim();
	    filter = filter.substring(filter.indexOf("<")+1, filter.indexOf(">"));
	    String[]  s = filter.split("@");
	    String nodeTag = s[0];
	    String attribute = s[1]; 
	    NodeFilter nodeFilter = null;
	    NodeList nodelist = null;
	    String attributeKey = attribute.substring(0,attribute.indexOf("="));
	    String attributeValue = attribute.substring(
				attribute.indexOf("\"")+1, attribute.lastIndexOf("\""));
	    if(nodeTag.equals("p") || nodeTag.equals("div")) {
	    		
	    		nodeFilter = new AndFilter(
	    				new TagNameFilter(nodeTag),
	    				new HasAttributeFilter(attributeKey, attributeValue));
	    }else {
	    	//留给扩展
	    }
	    try {
			nodelist = parser.parse(nodeFilter);
			if(nodelist.size() != 0){
				for(int i = 0; i <nodelist.size(); i++) {
					if(nodeTag.equals("div")) {
						Div divNode = (Div) nodelist.elementAt(i);
						sb.append("DivTag" + divNode.toPlainTextString());
					}else if(nodeTag.equals("p")) {
						Node pNode = nodelist.elementAt(i);
						sb.append("PTag" + pNode.toPlainTextString());
					}
				}
				return true;
			}
		} catch (ParserException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	    return false;
	}

至此真个工具类的代码分析完毕，在parse-html插件中的htmlParser类中获取text的位置加入如下代码：

MainContentUtils mct = new MainContentUtils(
        sb, content.getUrl(), contentInOctets, encoding);
      mct.getMainContent();
      text = sb.toString();
      FileWriter fw;
 try {
  fw = new FileWriter("E://mainbodypage//URLText.txt",true);//用于测试
  fw.write("url::" + content.getUrl() + "\n");
  fw.write("text::" + text + "\n");
  fw.close();
 } catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
 }
      sb.setLength(0);

完成了信息采集的代码工具，下一步就是要控制url的队列，爬取所有url的信息，之后再考虑信息和索引的更新。