爬虫基础

最新推荐文章于 2025-05-09 18:17:43 发布

金玉良缘2017

最新推荐文章于 2025-05-09 18:17:43 发布

阅读量628

点赞数

CC 4.0 BY-SA版权

分类专栏：爬虫 java 文章标签： html 网络爬虫 jsoup java爬虫

本文链接：https://blog.youkuaiyun.com/qq_24708791/article/details/78289882

java 同时被 2 个专栏收录

42 篇文章

订阅专栏

爬虫

3 篇文章

订阅专栏

1、爬虫的概念

爬虫是什么

爬虫又叫网络爬虫，是一种运行在互联网上为了获取数据的自动化程序。

爬虫简单的分类

百度互联网所有的通用爬虫

为做数据分析而存在的爬虫，垂直爬虫。

淘宝评论爬虫

淘宝商品爬虫

分类的标准：根据数据量或者业务范围

在互联网上，大多数都是垂直爬虫，也就是值爬取一定范围内的数据。

爬虫爬取一个页面的流程

指定一个URL
发起一个网络请求 HTTP
得到一个HTML文档
解析HTML文档

爬虫爬取多个页面

1）指定很多个URL数据结构 list

2）从list中依次拿取url

    i.发起一个网络请求 HTTP
    ii.得到一个HTML文档
    iii.解析HTML文档
        1.顺便解析出其他URL
        2.将解析的URL存放到等待爬取的URL中

这里写图片描述

1）给定一个或者多个URL，将这个URL存放到等待抓取的URL队列
2）从队列中读取一个URL  http://www.itcast.cn/xxx.html
3）域名解析 得到IP地址
4）发起HTTP请求
5）得到相应的HTML文档
6）解析HMTL文档
    a)解析内容
    b)解析新的URL

2、使用HttpClient访问网络

开发前的准备工作

1、创建maven项目
2、导入依赖

 <dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.5.3</version>
</dependency>

导入fluent的依赖

    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>fluent-hc</artifactId>
        <version>4.5.3</version>
    </dependency>

使用HttpClient进行GET请求Demo


import java.nio.charset.Charset;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

public class HttpClientGet {
    public static void main(String[] args) throws Exception {
        // 1. 指定一个url
        String url="http://www.163.com";
        // 2. 创建一个默认的HttpClient
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 如果想设置Header 
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36");
        // 3. 如过是get请求，创建一个get对象
        HttpGet httpGet=new HttpGet(url);
        // 4. 发起请求
        CloseableHttpResponse response = httpClient.execute(httpGet);
        // 5. 获取数据
        HttpEntity entity = response.getEntity();
        // 6. 打印数据
        String entityDate = EntityUtils.toString(entity, Charset.forName("utf-8"));

        System.out.println(entityDate); 
    }
}

使用HttpClient进行POST请求Demo

import java.nio.charset.Charset;
import java.util.ArrayList;

import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;

public class HttpClientPost {

    public static void main(String[] args) throws Exception {
        //1.指定一个url 
        String url="http://www.itcast.cn/login.html";

        // 2. 创建一个默认的HttpClient
        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 3. 如过是post请求，创建一个post对象
        HttpPost httpPost=new HttpPost(url);

        //4设置参数
        ArrayList<BasicNameValuePair> parameters=new ArrayList<BasicNameValuePair>();
        parameters.add(new BasicNameValuePair("username","DHC"));
        parameters.add(new BasicNameValuePair("password", "123"));
        httpPost.setEntity(new UrlEncodedFormEntity(parameters));
        // 5. 发起请求
        CloseableHttpResponse response = httpClient.execute(httpPost);
        //6.获取信息
        String charsetEntity = EntityUtils.toString(response.getEntity(), Charset.forName("utf-8"));;
        System.out.println(charsetEntity.toString());

    }

}

3、使用JSOUP工具包解析HTML

创建maven项目

引入jsoup依赖

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

jsoup解析html的语法

这里写图片描述

JSOUP工具包解析HTML的Demo


import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTest {
    public static void main(String[] args) throws Exception {
        // 得到一个HTML文档
        // httpclient.execute(httpget);
        Document document = Jsoup.parse(new URL("http://www.itcast.cn"), 10 * 1000);
        //获取title
        Elements titles = document.select("title");
        for (Element element : titles) {
            System.out.println(element.ownText());
        }
        // meta
        Elements metas = document.select("meta");
        for (Element element : metas) {
            System.out.println(element.attr("name")+element.attr("content"));
        }
        // 获取所有的a标签
        Elements aTags = document.select("a[href=http://yun.itheima.com/course]");
        for (Element element : aTags) {
            System.out.println(element);
        }

        Elements banners = document.select(".info");
        for (Element element : banners) {
            System.out.println(element);
        }

        Elements atags = document.select("a[^cla]");
        for (Element element : atags) {
            System.out.println(element);
        }

        atags = document.select("ul[class=nav_txt]");
        for (Element element : atags) {
            System.out.println(element);
        }
        atags = document.select("[class^=nav]");
        for (Element element : atags) {
            System.out.println(element);
        }

        atags = document.select("div.ban_up");
        for (Element element : atags) {
            System.out.println(element);
        }
    }

}