java爬虫抓取简单网页数据,HtmlUnit+Jsoup简单爬虫获取网页数据-优快云博客

本案例可以获取宿迁论坛--关注宿迁etc下面帖子楼主发表的【标题、发表时间、网站名、发表人、内容(帖子正文)、访问链接】数据。

思想就是通过Java访问的链接，然后拿到html字符串，然后就是解析链接等需要的数据。

技术上使用了Jsoup+HtmlUnit:

采用htmlunit获取网页(官网地址http://htmlunit.sourceforge.net/)

采用jsoup解析网页，获取数据和链接.(文档https://jsoup.org/apidocs/)

使用方法：

第一步：

创建一个WebClient对象，

WebClient webClient = new WebClient(BrowserVersion.CHROME);

webClient.getOptions().setJavaScriptEnabled(false);

webClient.getOptions().setCssEnabled(false);

HtmlPage page= webClient.getPage(“https://www.baidu.com”);//获取网页页面

String pageContent= page.getTitleText(); //获取页面的TITLE

pageContent= page.asXml();//获取页面的XML代码

webClient.close(); //关闭webclient

第二步：

使用jsoup，解析网页内容，使用jquery语法方便地进行dom操作

Document doc = Jsoup.parse(pageContent);

String title = doc.title().toString();//获取网页标题

Elements lg = doc.getElementById("lg").getElementsByTag("img");//获取指定id下面的img

String picUrl = lg.attr("src");//获取图片的路径

示例解析地址:

1、http://www.sqee.cn/thread-884131-1-5.html

2、http://www.sqee.cn/thread-883835-1-1.html

解析结果:

1、{

"announceTime":"2017/6/20 11:38:40",

"announceUser":"4704545000",

"content":"我想请问大家有谁知道宿迁实验小学黄河分校2017年啥时候开始招生啊？或者有招生办的电话望告知一声。谢谢大家！！！！",

"title":"关于小学上学问题 ",

"webSiteName":" 关注宿迁宿迁论坛|鼎鼎有民|大宿网"

}

2、{

"announceTime":"2017/6/18 08:53:20",

"announceUser":"柠檬草",

"content":"三台山现在几点关门？听说现在营业时间延长了，具体是几点呢？有人知道吗？",

"title":"三台山现在几点关门？ ",

"webSiteName":" 娱乐旅游宿迁论坛|鼎鼎有民|大宿网"

}

示例代码：

public class PartPoliticalInfoService {

public static void main(String[] args) {

// 测试

System.out

.println(JSON

.toJSONString(getSqcInfo("http://www.sqee.cn/thread-883835-1-1.html")));

}

// 解析宿迁论坛，关注宿迁等里面帖子楼主发表的相关信息

public static PartPoliticalInfo getSqcInfo(String url) {

PartPoliticalInfo politicalInfo;

String pageContent, title = null, webSiteName = null, announceTime = null, announceUser = null, content = null;

// step 1 : 使用htmlunit，获取指定网址的网页的内容，该网页中的脚本已被执行

WebClient webClient = new WebClient(BrowserVersion.FIREFOX_31);

// htmlunit 对css和javascript的支持不好，所以请关闭之

webClient.getOptions().setJavaScriptEnabled(false);

webClient.getOptions().setCssEnabled(false);

HtmlPage page;

try {

page = webClient.getPage(url);

pageContent = page.asXml();

webClient.close();

// step 2 : 使用jsoup，解析网页内容，使用jquery语法方便地进行dom操作

Document doc = Jsoup.parse(pageContent);

Elements authi = doc.getElementsByClass("authi");

Elements t_fsz = doc.getElementsByClass("t_fsz");

String arrTitle[] = doc.title().toString().split("\\-");

if (arrTitle.length == 2) {

title = arrTitle[0];

webSiteName = arrTitle[1];

}

if (authi.size() >= 2) {

Elements eAnnounceUser = authi.get(0).getElementsByTag("a");

Elements eAnnounceTimeFormat1 = authi.get(1)

.getElementsByTag("span").get(0).getAllElements();

Elements eAnnounceTimeFormat2 = authi.get(1).getElementsByTag(

"em");

announceTime = !eAnnounceTimeFormat1.attr("title").equals("")

? eAnnounceTimeFormat1.attr("title")

: eAnnounceTimeFormat2.text();

announceUser = eAnnounceUser.text();

String timeTotal[] = announceTime.split(" ");

String dateStr = timeTotal[timeTotal.length - 2];

String timeStr = timeTotal[timeTotal.length - 1];

announceTime = dateStr.replaceAll("-", "/") + " " + timeStr;

}

if (t_fsz.size() >= 1) {

Elements eContent = t_fsz.get(0).getElementsByTag("td");

content = eContent.text();

}

} catch (UnknownHostException e) {

title = "UnknownHostException";

} catch (FailingHttpStatusCodeException e) {

e.printStackTrace();

} catch (MalformedURLException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

politicalInfo = new PartPoliticalInfo(title, announceTime, webSiteName,

announceUser, content);

return politicalInfo;

}

实体类：

public class PartPoliticalInfo {

private String title;// 标题

private String announceTime;// 发表时间

private String webSiteName;// 网站名

private String announceUser;// 发表人

private String content;// 内容(帖子正文)

private String url;// 访问链接

public PartPoliticalInfo() {

super();

}

public PartPoliticalInfo(String title, String announceTime,

String webSiteName, String announceUser, String content) {

super();

this.title = title;

this.announceTime = announceTime;

this.webSiteName = webSiteName;

this.announceUser = announceUser;

this.content = content;

}

public String getTitle() {

return title;

}

public void setTitle(String title) {

this.title = title;

}

public String getAnnounceTime() {

return announceTime;

}

public void setAnnounceTime(String announceTime) {

this.announceTime = announceTime;

}

public String getWebSiteName() {

return webSiteName;

}

public void setWebSiteName(String webSiteName) {

this.webSiteName = webSiteName;

}

public String getAnnounceUser() {

return announceUser;

}

public void setAnnounceUser(String announceUser) {

this.announceUser = announceUser;

}

public String getContent() {

return content;

}

public void setContent(String content) {

this.content = content;

}

public String getUrl() {

return url;

}

public void setUrl(String url) {

this.url = url;

}

总结：

本案例可以做到对指定Url的网页页面进行获取数据

因为是爬去指定网页的数据，所以需要开发人员对网页的源代码有一定的分析能力。

缺点：

Jsoup抓取不到js执行后的数据， HtmlUnit支持也不是很好(也可能是本人使用的方式不对)