网页爬虫，HttpClient+Jericho HTML Parser 实现网页的抓取

最新推荐文章于 2025-05-15 15:58:46 发布

最新推荐文章于 2025-05-15 15:58:46 发布 · 154 阅读

文章标签：

本文介绍如何使用Java的HttpClient库进行网页抓取，并通过JerichoHTMLParser解析HTML内容。示例代码展示了设置HttpClient、发送GET请求、处理响应及解析特定元素的方法。

Jericho HTML Parser是一个简单而功能强大的Java HTML解析器库，可以分析和处理HTML文档的一部分，包括一些通用的服务器端标签，同时也可以重新生成无法识别的或无效的HTML。它也提供了一个有用的HTML表单分析器。
下载地址:http://sourceforge.net/project/showfiles.php?group_id=101067

HttpClient作为HTTP客户端组件与服务器进行通讯，同时使用了jdom进行XML数据的解析。

HttpClient 可以在http://jakarta.apache.org/commons/httpclient/downloads.html下载
HttpClient 用到了 Apache Jakarta common 下的子项目 logging，你可以从这个地址http://jakarta.apache.org/site/downloads/downloads_commons-logging.cgi下载到 common logging，从下载后的压缩包中取出 commons-logging.jar 加到 CLASSPATH 中
HttpClient 用到了 Apache Jakarta common 下的子项目 codec，你可以从这个地址http://jakarta.apache.org/site/downloads/downloads_commons-codec.cgi 下载到最新的 common codec，从下载后的压缩包中取出 commons-codec-1.x.jar 加到 CLASSPATH 中

在对网页信息进行抓取时, 主要会用到GET 方法

使用 HttpClient 需要以下 6 个步骤：

1. 创建 HttpClient 的实例

2. 创建某种连接方法的实例，在这里是 GetMethod。在 GetMethod 的构造函数中传入待连接的地址

3. 调用第一步中创建好的实例的 execute 方法来执行第二步中创建好的 method 实例

4. 读 response

5. 释放连接。无论执行方法是否成功，都必须释放连接

6. 对得到后的内容进行处理

在eclipse下建立工程 -->snatch
将上面下载的四个jar文件导入到项目路径中.
环境搭建完成

现在,首先介绍一下HttpClient的使用
在工程目录下创建test包,在包中创建Httpclient Test类

package test;

import java.io.IOException;

import org.apache.commons.httpclient. * ;

import org.apache.commons.httpclient.methods.GetMethod;

import org.apache.commons.httpclient.params.HttpMethodParams;

public class HttpClientTest ... {

publicstaticvoidmain(String[]args)...{

//构造HttpClient的实例

HttpClienthttpClient=newHttpClient();

//创建GET方法的实例

GetMethodgetMethod=newGetMethod("http://www.google.com.cn");

//使用系统提供的默认的恢复策略

getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,

newDefaultHttpMethodRetryHandler());

try...{

//执行getMethod

intstatusCode=httpClient.executeMethod(getMethod);

if(statusCode!=HttpStatus.SC_OK)...{

System.err.println("Methodfailed:"

+getMethod.getStatusLine());

}

//读取内容

byte[]responseBody=getMethod.getResponseBoy();

//处理内容

System.out.println(newString(responseBody));

}catch(HttpExceptione)...{

//发生致命的异常，可能是协议不对或者返回的内容有问题

System.out.println("Pleasecheckyourprovidedhttpaddress!");

e.printStackTrace();

}catch(IOExceptione)...{

//发生网络异常

e.printStackTrace();

}finally...{

//释放连接

getMethod.releaseConnection();

}

这样得到的是页面的源代码.
这里 byte[]responseBody=getMethod.getResponseBoy();是读取内容
除此之外,我们还可以这样读取:
InputStream inputStream= getMethod.getResponseBodyAsStream();
String responseBody = getMethod.getResponseBodyAsString();

下面结合两者给个事例
取出http://www.ahcourt.gov.cn/gb/ahgy_2004/fyxw/index.html
中"信息快递"栏的前几条信息.
新建类CourtNews

package test;

import java.io.IOException;

import java.util.ArrayList;

import java.util.Iterator;

import java.util.List;

import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;

import org.apache.commons.httpclient.HttpClient;

import org.apache.commons.httpclient.HttpException;

import org.apache.commons.httpclient.HttpStatus;

import org.apache.commons.httpclient.methods.GetMethod;

import org.apache.commons.httpclient.params.HttpMethodParams;

import au.id.jericho.lib.html.Element;

import au.id.jericho.lib.html.HTMLElementName;

import au.id.jericho.lib.html.Segment;

import au.id.jericho.lib.html.Source;

/***/ /**

*@authoroscar07-5-17

public class CourtNews ... {

privateintnewsCount=3;

privateListnewsList=newArrayList();

publicintgetNewsCount()...{

returnnewsCount;

}

publicvoidsetNewsCount(intnewsCount)...{

this.newsCount=newsCount;

}

publicListgetNewsList()...{

HttpClienthttpClient=newHttpClient();

GetMethodgetMethod=newGetMethod(

"http://www.ahcourt.gov.cn/gb/ahgy_2004/fyxw/index.html");

getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,

newDefaultHttpMethodRetryHandler());

try...{

intstatusCode=httpClient.executeMethod(getMethod);

if(statusCode!=HttpStatus.SC_OK)...{

System.err

.println("Methodfailed:"+getMethod.getStatusLine());

}

StringresponseBody=getMethod.getResponseBodyAsString();

responseBody=newString(responseBody.getBytes("ISO-8859-1"),

"GB2312");

Sourcesource=newSource(responseBody);

inttableCount=0;

for(Iteratori=source.findAllElements(HTMLElementName.TABLE)

.iterator();i.hasNext();tableCount++)...{

Segmentsegment=(Segment)i.next();

if(tableCount==13)...{

inthrefCount=0;

for(Iteratorj=segment

.findAllElements(HTMLElementName.A).iterator();j

.hasNext();)...{

Segmentchildsegment=(Segment)j.next();

Stringtitle=childsegment.extractText();

title.replace(""," ");

title=trimTitle(title);

Elementchildelement=(Element)childsegment;

if(hrefCount<newsCount)...{

String[]news=newString[]...{

title,

"http://www.ahcourt.gov.cn"

+childelement

.getAttributeValue("href")};

newsList.add(news);

hrefCount++;

}

}catch(HttpExceptione)...{

System.out.println("pleasecheckyourprovidedhttpaddress!");

e.printStackTrace();

}catch(IOExceptione)...{

e.printStackTrace();

}finally...{

getMethod.releaseConnection();

}

returnnewsList;

}

privateStringtrimTitle(Stringtitle)...{

Stringtitlenew="";

for(inti=0;i<title.length();i++)...{

if(Character.isSpaceChar(title.charAt(i)))

titlenew+="";

else...{

titlenew+=title.charAt(i);

}

returntitlenew;

}

publicstaticvoidmain(String[]args)...{

//TODOAuto-generatedmethodstub

CourtNewsjustice=newCourtNews();

justice.setNewsCount(4);

Listlist=justice.getNewsList();

Iteratorit=list.iterator();

while(it.hasNext())...{

String[]news=(String[])it.next();

System.out.println(news[0]);

System.out.println(news[1]);

}