1.下载相应的Jar包:
<!-- httpClient组件 -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- Jsoup组件 -->
<dependency>
<!-- jsoup HTML parser library @ http://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>2.创建一个httpclient的实例:
// 创建一个httpclient实例
HttpClientBuilder httpClientBuilder = HttpClientBuilder.create();
CloseableHttpClient httpclient = httpClientBuilder.build();3.调用get,post方法:
HttpGet httpGet = new HttpGet(String url);
HttpPost httpPost=new HttpPost(String url);4.创建一个返回对象:
CloseableHttpResponse response = null;5.使用这个返回的对象来接收执行的请求:
response = httpclient.execute(httpGet);
需要try-catch;
6.消费实体的内容:
HttpEntity entity = response.getEntity();String html = EntityUtils.toString(entity, "utf-8");7.输出即可得到我们想要的网页内容
8.使用代理:
HttpHost proxy = new HttpHost(String host, int port);Builder globalConfig = RequestConfig.custom();
globalConfig.setConnectTimeout(10000);// 设置最长的连接时间
globalConfig.setSocketTimeout(30000);// 设置读取时间
httpGet.setConfig(globalConfig.setProxy(proxy).build());9.使用帐号和密码的代理ip:
CredentialsProvider credsProvider = null;
credsProvider = new BasicCredentialsProvider();
credsProvider.setCredentials(AuthScope.ANY,
new UsernamePasswordCredentials("username", "password"));
本文介绍如何利用HttpClient和Jsoup实现网页内容的抓取。从添加依赖开始,详细讲解了创建HttpClient实例、发送GET/POST请求、处理响应等步骤,并提供了设置代理及超时时间的方法。
699

被折叠的 条评论
为什么被折叠?



