1.引入pom依赖(httpClient,Jsoup)
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.13.1</version> </dependency>
2.创建简单的请求操作
@Test public void testJsoup() throws Exception { // 创建HttpClient CloseableHttpClient httpClient = HttpClients.createDefault(); // 创建GET请求 HttpGet httpGet = new HttpGet("https://blog.youkuaiyun.com/weixin_73375551/article/details/129018283?spm=1001.2014.3001.5502"); httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"); // 获取响应 CloseableHttpResponse response = httpClient.execute(httpGet); // 获取页面内容 if (response.getStatusLine().getStatusCode() == 200) { String html = EntityUtils.toString(response.getEntity(), "UTF-8"); // 创建Document对象 Document document = Jsoup.parse(html); Element element = document.getElementById("articleContentId"); String s = element.text(); System.out.println(s); } response.close(); httpClient.close(); }
HttpClient用于创建连接对象,如果请求方式为GET则可以创建HttpGet对象,若为POST请求可创建HttpPost对象,请求的参数为待访问的URL。
可以根据实际请求内容适当的增加header的内容。调用HttpClient的execute()方法发起请求,并创建一个CloseableHttpResponse响应对象,可以通过判断响应状态码确定请求的结果。
3.
@Test public void testJsoup() throws Exception { // 创建HttpClient CloseableHttpClient httpClient = HttpClients.createDefault(); // 创建GET请求 HttpGet httpGet = new HttpGet("https://blog.youkuaiyun.com/weixin_73375551/article/details/129018283?spm=1001.2014.3001.5502"); httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"); // 获取响应 CloseableHttpResponse response = httpClient.execute(httpGet); // 获取页面内容 if (response.getStatusLine().getStatusCode() == 200) { String html = EntityUtils.toString(response.getEntity(), "UTF-8"); // 创建Document对象 Document document = Jsoup.parse(html); Element element = document.getElementById("articleContentId"); String s = element.text(); System.out.println(s); } response.close(); httpClient.close(); }
通过调用Jsoup的parse(String html)方法即可将原始的HTML页面解析为Document类,这样我们就能够通过getElementById(String attr)、getElementsByClass(String attr)、select(String classAttr)等方式获取页面中的标签元素。