1. 网络爬虫
网络爬虫(Web crawler),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本
1.1. 爬虫入门程序
1.1.1. 环境
JDK1.8
IntelliJ IDEA
IDEA自带的Maven
1.1.2. 环境准备
创建Maven工程itcast-crawler-first并给pom.xml加入依赖:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.xxx</groupId>
<artifactId>crawler_first</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-log4j12 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
1.1.3. java代码编写:
public class CrawlerFirst {
public static void main(String[] args) throws Exception {
//1.打开浏览器,(创建HttpClient对象)
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.输入网址(输入网址,发起get请求,创建HttpGet对象)
HttpGet httpGet = new HttpGet("http://www.itcast.cn");
//3.按回车,发起请求,返回响应,使用HttpClient对象发起请求
CloseableHttpResponse response = httpClient.execute(httpGet);
//4.解析响应,获取数据
//判断状态码是否为200(成功)
if(response.getStatusLine().getStatusCode() == 200){
HttpEntity httpEntity = response.getEntity();
String content = EntityUtils.toString(httpEntity, "utf8");
System.out.println(content);
}
}
}
测试结果(获得相应页面的html文本数据):
2. 网络爬虫
2.1. 网络爬虫介绍
在大数据时代,信息的采集是一项重要的工作,而互联网中的数据是海量的,如果单纯靠人力进行信息采集,不仅低效繁琐,搜集的成本也会提高。如何自动高效地获取互联网中我们感兴趣的信息并为我们所用是一个重要的问题,而爬虫技术就是为了解决这些问题而生的。
网络爬虫(Web crawler)也叫做网络机器人,可以代替人们自动地在互联网中进行数据信息的采集与整理。它是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本,可以自动采集所有其能够访问到的页面内容,以获取相关数据。
从功能上来讲,爬虫一般分为数据采集,处理,储存三个部分。爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。
3. HttpClient
网络爬虫就是用程序帮助我们访问网络上的资源,我们一直以来都是使用HTTP协议访问互联网的网页,网络爬虫需要编写程序,在这里使用同样的HTTP协议访问网页。
这里我们使用Java的HTTP协议客户端 HttpClient这个技术,来实现抓取网页数据。
3.1. GET请求
访问百度官网,请求url地址:
http://www.baidu.com/
public class HttpGetTest {
public static void main(String[] args) {
//1.创建一个新的HTTPClient对象(打开浏览器)
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建一个HttpGet对象(输入网址)
HttpGet httpGet = new HttpGet("http://www.baidu.com");
//3.执行发起http请求
CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpGet);
//4.得到响应的结果,判断响应的状态码是否为200,如果是的话就打印响应的数据
if(response.getStatusLine().getStatusCode() == 200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭资源
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
结果:
3.2带参数的GET请求
在传智中搜索学习视频,地址为:
http://yun.itheima.com/search?keys=Java
public class HttpGetParamTest {
public static void main(String[] args) throws Exception {
//1.创建一个新的HTTPClient对象(打开浏览器)
CloseableHttpClient httpClient = HttpClients.createDefault();
//设置请求地址为:http://yun.itheima.com/search?keys=Java
//创建URIBuilder
URIBuilder builder = new URIBuilder("http://yun.itheima.com/search");
//设置参数(多个参数的时候可以多次 .setParameter)
builder.setParameter("keys","java");
//2.创建一个HttpGet对象(输入网址)
HttpGet httpGet = new HttpGet(builder.build());
//3.执行发起http请求
CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpGet);
//4.得到响应的结果,判断响应的状态码是否为200,如果是的话就打印响应的数据
if(response.getStatusLine().getStatusCode() == 200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭资源
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
结果:
3.3POST请求
与上边的GET请求差别就在于 HttpPost 这里:
public class HttpPostTest {
public static void main(String[] args) {
//1.创建一个新的HTTPClient对象(打开浏览器)
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建一个HttpGet对象(输入网址)
HttpPost httpPost = new HttpPost("http://www.itcast.cn");
//3.执行发起http请求
CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpPost);
//4.得到响应的结果,判断响应的状态码是否为200,如果是的话就打印响应的数据
if(response.getStatusLine().getStatusCode() == 200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭资源
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
3.4带参数的POST请求
public class HttpPostParamTest {
public static void main(String[] args) throws Exception{
//1.创建一个新的HTTPClient对象(打开浏览器)
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建一个HttpGet对象(输入网址)
HttpPost httpPost = new HttpPost("http://yun.itheima.com/search");
//声明List集合,封装表单中的参数
List<NameValuePair> params = new ArrayList<NameValuePair>();
//设置请求地址为:http://yun.itheima.com/search?keys=Java
params.add(new BasicNameValuePair("keys","Java"));
//创建表单的entity对象
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf8");
//设置表单的Entity对象到Post请求中
httpPost.setEntity(formEntity);
//3.执行发起http请求
CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpPost);
//4.得到响应的结果,判断响应的状态码是否为200,如果是的话就打印响应的数据
if(response.getStatusLine().getStatusCode() == 200){
String content = EntityUtils.toString(response.getEntity