Android-爬取网页内容的几种方法

最新推荐文章于 2024-11-17 20:50:37 发布

原创最新推荐文章于 2024-11-17 20:50:37 发布 · 3.3k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#爬虫 Android 抓取网页

java Android 专栏收录该内容

14 篇文章

订阅专栏

本文介绍了在Android中使用urlConnection进行网页内容抓取的方法，通过HttpURLConnection对象结合InputStreamReader，将网页内容转化为字符串进行处理。

记录几种抓取网页数据的办法，就是已知一个网页的域名，获取网页内容为一个String字符串或者Document对象。

第一种：urlConnection，通过url类的openConnection（）方法，得到一个HttpURLConnection对象。通过InputStreamReader将整个网页内容转为String字符串。

URL url = new URL(Url);
        HttpURLConnection httpConn = (HttpURLConnection)url.openConnection();
        if(httpConn.getResponseCode() == HttpURLConnection.HTTP_OK)
        {       
            Log.d("TAG", "---into-----urlConnection---success--");
             
            InputStreamReader isr = new InputStreamReader(httpConn.getInputStream(), "utf-8");
            int i;
            String content = "";
            while((i = isr.read()) != -1)
            {
                content = content + (char)i;
            }
            
            isr.close();
            httpConn.disconnect();
        }else
        {
            Log.d("TAG", "---into-----urlConnection---fail--");
             
        }

第二种：httpClient ，将url放入一个httpget对象中，然后用httpClient的excute方法去得到httpResponse的对象或者直接得到网页的String串。

   DefaultHttpClient httpClinet = new DefaultHttpClient();
        HttpGet httpGet = new HttpGet(Url);
        ResponseHandler<String> responseHandler = new BasicResponseHandler();
        try {
        	
            String content = httpClinet.execute(httpGet, responseHandler);
            //HttpResponse resp = httpClinet.execute(httpGet);
        } catch (ClientProtocolException e) {
             
            e.printStackTrace();
        } catch (IOException e) {
             
            e.printStackTrace();
        }

第三种：使用Jsoup：Document doc = Jsoup.parse(new URL("http://www.baidu.com"), 5000);