配置Nutch模拟浏览器以绕过反爬虫限制

本文详细介绍了如何通过配置Nutch来模拟浏览器,绕过网站的简单反爬虫策略,包括设置User-Agent以伪装成不同类型的浏览器,如Firefox、IE、Chrome等。

当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓取的所有页面内容均为:您的访问请求被拒绝 ...... 这是最简单的反爬虫策略(该策略简单地读取HTTP请求头User-Agent的值来判断是人(浏览器)还是机器爬虫),我们只需要简单地配置Nutch来模拟浏览器(simulate web browser)就可以绕过这种限制。

在nutch-default.xml中有5项配置是和User-Agent相关的:
    <property>  
      <name>http.agent.description</name>  
      <value></value>  
      <description>Further description of our bot- this text is used in  
      the User-Agent header.  It appears in parenthesis after the agent name.  
      </description>  
    </property>  
    <property>  
      <name>http.agent.url</name>  
      <value></value>  
      <description>A URL to advertise in the User-Agent header.  This will   
       appear in parenthesis after the agent name. Custom dictates that this  
       should be a URL of a page explaining the purpose and behavior of this  
       crawler.  
      </description>  
    </property>  
    <property>  
      <name>http.agent.email</name>  
      <value></value>  
      <description>An email address to advertise in the HTTP 'From' request  
       header and User-Agent header. A good practice is to mangle this  
       address (e.g. 'info at example dot com') to avoid spamming.  
      </description>  
    </property>  
    <property>  
      <name>http.agent.name</name>  
      <value></value>  
      <description>HTTP 'User-Agent' request header. MUST NOT be empty -   
      please set this to a single word uniquely related to your organization.  
      NOTE: You should also check other related properties:  
        http.robots.agents  
        http.agent.description  
        http.agent.url  
        http.agent.email  
        http.agent.version  
      and set their values appropriately.  
      </description>  
    </property>  
    <property>  
      <name>http.agent.version</name>  
      <value>Nutch-1.7</value>  
      <description>A version string to advertise in the User-Agent   
       header.</description>  
    </property>

在类nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这5项配置是如何构成User-Agent的:
    this.userAgent = getAgentString( conf.get("http.agent.name"),   
            conf.get("http.agent.version"),   
            conf.get("http.agent.description"),   
            conf.get("http.agent.url"),   
            conf.get("http.agent.email") ); 

    private static String getAgentString(String agentName,  
                                         String agentVersion,  
                                         String agentDesc,  
                                         String agentURL,  
                                         String agentEmail) {  
        
      if ( (agentName == null) || (agentName.trim().length() == 0) ) {  
        // TODO : NUTCH-258  
        if (LOGGER.isErrorEnabled()) {  
          LOGGER.error("No User-Agent string set (http.agent.name)!");  
        }  
      }  
        
      StringBuffer buf= new StringBuffer();        
      buf.append(agentName);  
      if (agentVersion != null) {  
        buf.append("/");  
        buf.append(agentVersion);  
      }  
      if ( ((agentDesc != null) && (agentDesc.length() != 0))  
      || ((agentEmail != null) && (agentEmail.length() != 0))  
      || ((agentURL != null) && (agentURL.length() != 0)) ) {  
        buf.append(" (");  
          
        if ((agentDesc != null) && (agentDesc.length() != 0)) {  
          buf.append(agentDesc);  
          if ( (agentURL != null) || (agentEmail != null) )  
            buf.append("; ");  
        }  
          
        if ((agentURL != null) && (agentURL.length() != 0)) {  
          buf.append(agentURL);  
          if (agentEmail != null)  
            buf.append("; ");  
        }  
          
        if ((agentEmail != null) && (agentEmail.length() != 0))  
          buf.append(agentEmail);  
          
        buf.append(")");  
      }  
      return buf.toString();  
    }  

在类nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头,这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:

    String userAgent = http.getUserAgent();  
    if ((userAgent == null) || (userAgent.length() == 0)) {  
        if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }  
    } else {  
        reqStr.append("User-Agent: ");  
        reqStr.append(userAgent);  
        reqStr.append("\r\n");  
    } 
通过上面的分析可知:在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器(Imitating a specific browser):
 
1、模拟Firefox浏览器:
    <property>  
        <name>http.agent.name</name>  
        <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>  
    </property>  
    <property>  
        <name>http.agent.version</name>  
        <value>20100101 Firefox/27.0</value>  
    </property>  

2、模拟IE浏览器:
    <property>  
        <name>http.agent.name</name>  
        <value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>  
    </property>  
    <property>  
        <name>http.agent.version</name>  
        <value>6.0)</value>  
    </property>  

3、模拟Chrome浏览器:
    <property>  
        <name>http.agent.name</name>  
        <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>  
    </property>  
    <property>  
        <name>http.agent.version</name>  
        <value>537.36</value>  
    </property>  

4、模拟Safari浏览器:
    <property>  
        <name>http.agent.name</name>  
        <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>  
    </property>  
    <property>  
        <name>http.agent.version</name>  
        <value>534.57.2</value>  
    </property>  

 5、模拟Opera浏览器:
    <property>  
        <name>http.agent.name</name>  
        <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>  
    </property>  
    <property>  
        <name>http.agent.version</name>  
        <value>19.0.1326.59</value>  
    </property>  

后记:查看User-Agent的方法:
1、http://www.useragentstring.com
2、http://whatsmyuseragent.com
3、http://www.enhanceie.com/ua.aspx

转载于:https://my.oschina.net/junfrank/blog/288048

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值