爬虫爬取页面过程中HttpClient导致的进程阻塞问题

最新推荐文章于 2024-11-26 21:06:51 发布

struggleee_luo

最新推荐文章于 2024-11-26 21:06:51 发布

阅读量5.3k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Java语言学习文章标签：爬虫线程 HttpClient socket编程

本文链接：https://blog.youkuaiyun.com/u010695420/article/details/53898526

Java语言学习专栏收录该内容

2 篇文章

订阅专栏

本文分享了解决爬虫在爬取特定网站时遇到的线程阻塞问题的经验，通过分析发现阻塞原因在于readline()函数及未设置超时时间，最终采用HttpClient并设置超时时间成功解决问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

爬虫爬取页面过程中HttpClient导致进程阻塞问题

目前在做爬虫项目，爬取多个书籍网站的书籍详情页面，遇到一个很恶心的问题，别的网站都能在短时间内完成爬取，唯独网站A的线程卡死，永远随机的阻塞在某个页面。定位到错误点在下载函数，这是初始的下载函数：

public String staticDownload(String urlstr, String encoding,String param) throws Exception{
StringBuffer buffer = new StringBuffer();
URL url = null;
PrintWriter out = null;
BufferedReader in = null;
try {
  url = new URL(urlstr);
  URLConnection connection = url.openConnection();
  ((HttpURLConnection) connection).setRequestMethod("POST");
  connection.setDoOutput(true);
  connection.setDoInput(true);
  connection.setConnectTimeout(5000);
  connection.setReadTimeout(5000);
  connection.setRequestProperty("accept", "*/*");
  connection.setRequestProperty("connection", "Keep-Alive");
  connection.setRequestProperty("User-Agent", "Mozilla/5.0 "
      + "(Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.14) "
      + "Gecko/20080404 Firefox/2.0.0.14");
  out = new PrintWriter(connection.getOutputStream());
  // 发送请求参数
  out.print(param);
  // flush输出流的缓冲
  out.flush();
  in = new BufferedReader(new     InputStreamReader(connection.getInputStream(), encoding));
  String line;
  while ((line = in.readLine()) != null) {
    buffer.append(line);
    buffer.append("\r\n");
  }
}
    catch (Exception e) {
  // TODO: handle exception
    }
    finally{
       try{
              if(out!=null){
                  out.close();
              }
              if(in!=null){
                  in.close();
              }
          }
          catch(IOException ex){
              ex.printStackTrace();
          }
    }
return buffer.toString();
}

这里延伸一下，页面下载方式有很多种，如果是爬虫，最好是模拟浏览器行为下载页面，使用WebClient方法，但对于需要人行为参与的页面，比如网站的搜索页面，需要填入搜索项进而获得爬取的内容，我们知道向指定网站发出请求的方式有两种：get和post方式。

基于HTTP 协议来访问网络资源的URLconnection 和HttpClient均可以实现上述请求，贴上两者区别的地址，具体不做分析。显然我们这里用的是前者。

通过查找资料知道readline()是一个阻塞函数，当没有数据读取时就会一直卡在那里:

1、只有当数据流出现异常或者网站服务端主动close()掉时才会返回null值，
2、如果不指定buffer的大小，则readLine()使用的buffer有8192个字符。在达到buffer大小之前，只有遇到“/r”、”/n”、”/r/n”才会返回。

我们不知道所爬取的网站服务端返回的是否有内容，为空数据也会阻塞，如果有内容每一行内容到底有没有包含以上三个特殊字符，如果不包含，则会进入阻塞，也就说while循环无法跳出，真正的问题找到了，那么只能换掉readline()了，资料也建议socket流最好避免使用readline()函数。

既然URLConnection不行那就换成HttpClient吧，后者比前者更为强大，也不需要readline()函数，反正病急乱投医喽，我们的问题出现在以post方式获得页面的函数上，param为传入的值，再次运行爬虫问题定位到:

public String staticDownloadByHttpClient(String urlstr, String encoding, boolean bFrame, String param) throws IOException {
String bufferStr= null;
// 创建默认的httpClient实例.
CloseableHttpClient httpclient = HttpClients.createDefault();
// 创建httppost
HttpPost httppost = new HttpPost(urlstr);
// 创建参数队列
List<NameValuePair> formparams = new ArrayList<NameValuePair>();
String name = param.split("=")[0];
String value = param.split("=")[1];
formparams.add(new BasicNameValuePair(name, value));
UrlEncodedFormEntity uefEntity;
try {
  uefEntity = new UrlEncodedFormEntity(formparams, "UTF-8");
  httppost.setEntity(uefEntity);

  CloseableHttpResponse response = httpclient.execute(httppost);
  if (response == null) {
    httpclient.close();
    return  bufferStr;
  }
  try {
    HttpEntity entity = response.getEntity();
    if (entity != null) {
      InputStream is = entity.getContent();
      InputStreamReader in = new InputStreamReader(is, encoding);
      int ch = 0;
      //貌似这条if语句没啥用，当时主要怕网站返回数据为空
      if((ch = in.read())!=-1){
        //问题出现下面这条语句上
        bufferStr = EntityUtils.toString(entity, encoding);
      }
      else{
        try {
          Thread.sleep(sleepTime);
        } catch (InterruptedException e) {
            e.printStackTrace()
        }
      }
    }
    try {
      EntityUtils.consume(entity);
    } catch (final IOException ignore) {
    }

  } finally {
    response.close();
  }
} catch (ClientProtocolException e) {
    e.printStackTrace()
} catch (UnsupportedEncodingException e) {
    e.printStackTrace()
} catch (IOException e) {
    e.printStackTrace()
} finally {
  // 关闭连接,释放资源
  try {
    httpclient.close();
  } catch (IOException e) {
      e.printStackTrace()
  }
}
return  bufferStr;
}

无奈只能去查看toSting函数源代码，该代码我有微小改动，基本是这样的:

private String toString(final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {
      Args.notNull(entity, "Entity");
      final InputStream instream = entity.getContent();
      if (instream == null) {
          return null;
      }
      try {
          Args.check(entity.getContentLength() <= Integer.MAX_VALUE,
                  "HTTP entity too large to be buffered in memory");
          int i = (int)entity.getContentLength();
          if (i < 0) {
              i = 4096;
          }
          Charset charset = null;
          try {
              final ContentType contentType = ContentType.get(entity);
              if (contentType != null) {
                  charset = contentType.getCharset();
              }
          } catch (final UnsupportedCharsetException ex) {
              throw new UnsupportedEncodingException(ex.getMessage());
          }
          if (charset == null) {
              charset = defaultCharset;
          }
          if (charset == null) {
              charset = HTTP.DEF_CONTENT_CHARSET;
          }
          final Reader reader = new InputStreamReader(instream, charset);
          final CharArrayBuffer buffer = new CharArrayBuffer(i);
          final char[] tmp = new char[1024];
          int l;
          long dis = System.currentTimeMillis();
          //问题依旧在这里
          while(reader.ready() && (l = reader.read(tmp)) != -1 ) {
              buffer.append(tmp, 0, l);
              long now = System.currentTimeMillis();
              if(now-dis > 5*60*1000){
                logUtil.getLogger().error(String.format("MSG: the content that site return is too large to be buffered in memory, 超时： %s ms", now-dis));
                break;
              }
          }
          return buffer.toString();
      } finally {
          instream.close();
      }
  }
  private String toString(final HttpEntity entity, final String defaultCharset) throws IOException, ParseException {
    return toString(entity, defaultCharset != null ?        Charset.forName(defaultCharset) : null);
}

继续定位问题，呵呵，依旧是while死循环问题，这里显然是同样的一个字符一个字符读入的，不存在readline函数问题，绝望之下百度了“HttpClient post 超时处理“，看到了此大神的很短的一篇日志，其中一句话是：
BTW,4.3版本不设置超时的话，一旦服务器没有响应，等待时间N久(>24小时)。

又看了看我的HttpClient jar包版本，墙裂感觉问题要被解决了，于是立刻加上超时设置：

RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(6000).setConnectTimeout(6000).build();//设置请求和传输超时时间
httppost.setConfig(requestConfig);

目前已测试8+遍，都没有出现线程再卡死的情况，后来想想，对于这种涉及到socket编程不都应该加上超时处理么，来，跟我读一遍以下文字：

我们知道Socket在读数据的时候是阻塞式的，如果没有读到数据程序会一直阻塞在那里。在同步请求的时候我们肯定是不能允许这样的情况发生的，这就需要我们在请求达到一定的时间后控制阻塞的中断，让程序得以继续运行。Socket为我们提供了一个setSoTimeout()方法来设置接收数据的超时时间，单位是毫秒。当设置的超时时间大于0，并且超过了这一时间Socket还没有接收到返回的数据的话，Socket就会抛出一个SocketTimeoutException

参考：
http://blog.youkuaiyun.com/hguang_zjh/article/details/33743249
http://blog.youkuaiyun.com/wuhong_csdn/article/details/50830349
http://witcheryne.iteye.com/blog/1135817
http://www.yiibai.com/java/io/bufferedreader_ready.html
https://zhidao.baidu.com/question/330258186.html
https://my.oschina.net/u/577453/blog/173724
http://elim.iteye.com/blog/1979837