最近在自学网络爬虫,参考教材是《自己动手写网络爬虫》,使用ide为eclipse。感觉书上入门的例子有些问题,于是我参考了httpclient 3.1的文档,爬取了百度首页的html,算是一个初学者入门的东西吧,把代码和一些心得和大家分享一下。
httpclient 3.1 下载地址:http://archive.apache.org/dist/httpcomponents/commons-httpclient/3.0/
import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;
import java.io.*;
public class HttpClientTutorial {
private static String url = "https://www.baidu.com/";
public static void main(String[] args) {
// Create an instance of HttpClient.
HttpClient client = new HttpClient();
// Create a method instance.
GetMethod method = new GetMethod(url);
// Provide custom retry handler is necessary
method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler(3, false));
try {
// Execute the method.
int statusCode = client.executeMethod(method);
if (statusCode != HttpStatus.SC_OK) {
System.err.println("Method failed: " + method.getStatusLine());
}
// Read the response body.
byte[] responseBody = method.getResponseBody();
// Deal with the response.
// Use caution: ensure correct character encoding and is not binary data
System.out.println(new String(responseBody));
} catch (HttpException e) {
System.err.println("Fatal protocol violation: " + e.getMessage());
e.printStackTrace();
} catch (IOException e) {
System.err.println("Fatal transport error: " + e.getMessage());
e.printStackTrace();
} finally {
// Release the connection.
method.releaseConnection();
}
}
}
运行结果:
如果汉字部分为乱码,则可以在window-preference-general-workspace下设置编码:
参考资料:
http://hc.apache.org/httpclient-legacy/tutorial.html
https://zhidao.baidu.com/question/625740030659487804.html