以论坛《宽带山》为例,需要根据给定的关键词,取得关于该关键词的所有帖子,包括人气数,发帖主题,回复数,发表人,发表时间,帖子链接,帖子详细文本内容等。
详细代码如下:
- importjava.util.ArrayList;
- importjava.util.HashMap;
- importjava.util.List;
- importjava.util.Map;
- importorg.jsoup.Jsoup;
- importorg.jsoup.nodes.Document;
- importorg.jsoup.nodes.Element;
- importorg.jsoup.select.Elements;
- publicclassKeyWordsSearchUtil{
- /**
- *根据关键词查询论坛所需信息map
- *@paramKeyWord传入关键词
- *@return
- */
- publicstaticList<Map<String,Object>>findByKeyWord(StringKeyWord){
- List<Map<String,Object>>postsList=newArrayList<Map<String,Object>>();
- Map<String,Object>postsOneMap=null;
- try{
- Documentdoc=Jsoup.connect("http://club.pchome.net/forum_1_15____md__1_"+java.net.URLEncoder.encode(KeyWord,"utf-8")+".html")
- .data("query","Java")
- .userAgent("Mozilla")
- .cookie("auth","token")
- .timeout(10000)
- .ignoreHttpErrors(true)
- .post();
- ElementspostsLs=doc.select("li.i2").not(".h-bg");
- if(postsLs!=null&&postsLs.size()>0){
- for(ElementchildPost:postsLs){
- postsOneMap=newHashMap<String,Object>();
- postsOneMap.put("postsPopularity",childPost.select("li>span.n2").first().text());
- postsOneMap.put("postsTitle",childPost.select("span.n3>a").attr("title"));
- postsOneMap.put("postsFloor",childPost.select("span.n4").first().text());
- postsOneMap.put("postsCname",childPost.select("a.bind_hover_card").first().text());
- postsOneMap.put("postsCtime",childPost.select("li>span.n6").first().text());
- postsOneMap.put("postsUrl","http://club.pchome.net"+childPost.select("span.n3a").attr("href"));
- postsOneMap.put("postsContents",getContentsByUrl("http://club.pchome.net"+childPost.select("span.n3a").attr("href")));
- postsList.add(postsOneMap);
- }
- }
- }catch(Exceptione){
- e.printStackTrace();
- }
- returnpostsList;
- }
- /**
- *根据帖子的url获取帖子的文本内容
- *@paramurl帖子的路径
- *@return
- */
- publicstaticStringgetContentsByUrl(Stringurl){
- Stringcontents="11";
- try{
- Documentdoc=Jsoup.connect(url)
- .data("query","Java")
- .userAgent("Mozilla")
- .cookie("auth","token")
- .timeout(10000)
- .ignoreHttpErrors(true)
- .post();
- if(doc.select("div.mc").first()!=null){
- ElementcontentsEle=doc.select("div.mcdiv").first();
- contents=contentsEle.select("div").first().text();
- if(contents.contains("[向左转][向右转][原图]")){
- contents=contents.replace("[向左转][向右转][原图]","");
- }
- }
- }catch(Exceptione){
- e.printStackTrace();
- }
- returncontents;
- }
- publicstaticvoidmain(String[]args)throwsException{
- List<Map<String,Object>>postsList=KeyWordsSearchUtil.findByKeyWord("电影");
- System.out.println("http://club.pchome.net/forum_1_15____md__1_"+java.net.URLEncoder.encode("电影","utf-8")+".html");
- System.out.println(postsList.size()+"/////");
- for(inti=0;i<postsList.size();i++){
- for(Map.Entry<String,Object>entry:postsList.get(i).entrySet()){
- System.out.println("key="+entry.getKey()+"|value="+entry.getValue());
- }
- System.out.println("-----------------");
- }
- //http://club.pchome.net/thread_1_15_7519679.html
- //Stringstr=getContentsByUrl("http://club.pchome.net/thread_1_15_7519679.html");
- //System.out.println(str);
- }
- }
以上代码能成功抓取宽带山论坛中,关键词为:电影 的相关帖子列表,main方法中已有测试,网络畅通下可测试通过。但上面代码仅为完成功能,性能较差,项目中需重写或优化