.NET网站爬虫初体验

最新推荐文章于 2024-06-05 09:30:00 发布

原创最新推荐文章于 2024-06-05 09:30:00 发布 · 529 阅读

CC 4.0 BY-SA版权

前段日子领导给分配了个爬取某个网站某几项栏目数据的任务。打开网站看了下，首先该网站是没有验证码限制的，这无形中减小了我的爬取难度，其二该网站采用了是.NET方式开发的对与这个语言不太了解，故我本着知己知彼才能百战百胜的策略先去网上了解了下这门语言。

在这一过程中，逐渐了解到，VIEWSTATE这一关键属性，它代表了网站当前的一种页面状态，当我在用httpClient模拟浏览器请求时也需要模拟这一状态，否则失败。

在此基础上我就开始着手爬取改网站数据了。首先必须解决的一个问题是登录验证问题，首先它没设验证码，初看时感觉很简单，但是我按着他页面上的登录要求模拟登录时一直失败，比较麻烦的是我用已有的用户名密码登录时拿不到他真正的验正的URL参数列表，它在成功处理这个请求时就已经转发到另一个URL了而该URL中的参数列表并不是我所需要的。当时也没搞定这个问题，就先耍无赖的把他已经登录成功后的VIEWSTATE拿到手保存起来在后面的环节里直接用。后来在优化时，发现当输入错误的用户名或密码时能拿到参数列表，进而对参数列表进行分析，发现它对登录时鼠标点击的位置还有参数要求。。经过观察得到了鼠标点击位置要求的合理范围，对此范围做随机数生成，并对它所需参数列表进行喂值，so，小爬虫就可以稳定的运行了。

在此过程中遇到的坑，用google浏览器在查看请求参数时由于viewState的大数据量经常会出现假死的情况，后面换了firefox就没有这个问题，当时这也是困扰我的一个障碍 - .-!,选择合适的工具还是很重要的~

对了最后补充下对爬取到的html我是通过jsoup来处理的，因为类似jq的选择器对与获取文本，值什么的也挺方便的，方法也比较齐全。

httpClient:

public static String postRequest(String url, NameValuePair[] data,
String cookie) {
String responseHtml = null;
HttpClient httpClient = new HttpClient();
PostMethod postMethod = new PostMethod(url);
try {
postMethod.setRequestHeader("Accept-Encoding", "gzip, deflate");
postMethod.setRequestHeader("Content-Type",
"application/x-www-form-urlencoded");
if (!"".equals(cookie)&&cookie!=null) {
postMethod.setRequestHeader("Cookie", cookie);
}
postMethod
.setRequestHeader(
"User-Agent",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36");
if(data!=null){
postMethod.setRequestBody(data);
}
httpClient.executeMethod(postMethod);
Cookie[] cookies = httpClient.getState().getCookies();
for (Cookie c : cookies) {
System.err.println("cookies = "+c.toString());
}
responseHtml = postMethod.getResponseBodyAsString();
} catch (HttpException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
if (postMethod != null) {
postMethod.releaseConnection();
}
}

return responseHtml;
}

/***
*
* 获取cookie
* @param url 网址
* @param data 请求参数
* @return cookie字符串
*/
public static String getWebsiteCookie(String url, NameValuePair[] data){
StringBuffer cookie=new StringBuffer();
HttpClient httpClient = new HttpClient();
PostMethod postMethod = new PostMethod(url);
try {
postMethod.setRequestHeader("Accept-Encoding", "gzip, deflate");
postMethod.setRequestHeader("Content-Type",
"application/x-www-form-urlencoded");
postMethod
.setRequestHeader(
"User-Agent",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36");
if(data!=null){
postMethod.setRequestBody(data);
}
httpClient.executeMethod(postMethod);
//HTTP响应头部信息
Header[] headers=postMethod.getResponseHeaders();
for(int i=0;i<headers.length;i++){
Header h=headers[i];
if("Set-Cookie".equals(h.getName())){
String s=h.getValue();
String sub=null;
if(i==(headers.length-1)){
sub=s.substring(0, s.indexOf(";"));
}else{
sub=s.substring(0, s.indexOf(";"))+";";
}
cookie.append(sub);
// System.out.println("cookie:"+sub);
}
}
// System.out.println(responseHtml);
} catch (HttpException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
if (postMethod != null) {
postMethod.releaseConnection();
}
}

return cookie.toString();

}