情景说明:网页的数据格式比较简单,只是把小说内容爬取到本地保存,没有遇到反爬。
使用到的依赖如下:
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
网页代码:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>第十一章 末代皇帝&最后一个克格勃(3)-龙族3·黑月之潮(中)</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="keywords" content="第十一章 末代皇帝&最后一个克格勃(3)-龙族3·黑月之潮(中)" />
<meta name="description" content="第十一章 末代皇帝&最后一个克格勃(3)-龙族3·黑月之潮(中)" />
<!–[if lt IE 9]>
<script src=/css3-mediaqueries.js></script>
<![endif]–>
<link rel="stylesheet" type="text/css" media="screen and (max-width: 900px)" href="/wap.css" />
<link rel="stylesheet" type="text/css" media="screen and (min-width: 900px)" href="/dcy.css" />
<link rel="alternate" type="application/rss+xml" href="http://www.********.cc/longzu3heiyuezhichaozhong/feed.asp?cmt=371" title="Comments Feed for 第十一章 末代皇帝&最后一个克格勃(3)" />
<script src="http://www.********.cc/longzu3heiyuezhichaozhong/script/common.js" type="text/javascript"></script>
<script src="http://www.********.cc/longzu3heiyuezhichaozhong/function/c_html_js_add.asp" type="text/javascript"></script>
</head>
<body><div class="v"><h1 align="center" class="STYLE1">龙族3·黑月之潮(中)</h1></div>
<div class="site clearfix"><span style="float:right;"> <a href="http://www.********.cc/longzu3heiyuezhichaozhong/" >返回首页</a></span><a href="http://www.********.cc/longzu3heiyuezhichaozhong/">龙族3·黑月之潮(中)</a> > 第十一章 末代皇帝&最后一个克格勃(3)</div>
<div class="chaptertitle clearfix">
<h1>第十一章 末代皇帝&最后一个克格勃(3)</h1>
</div>
<div id="p_adtop" class="clearfix">
<div id="p_ad_t1"><script language="javascript" type="text/javascript" src="/ad1.js"></script></div>
<div id="p_ad_t2"><script language="javascript" type="text/javascript" src="/ad1.js"></script></div>
<div id="p_ad_t4"></div>
</div>
<div class="bookcontent clearfix" id="BookText"> 御神刀斩落,带着大片的弧光。橘正宗血光飞溅,战栗着倒地。<br/><br/> 怀刃插在地上,橘正宗用来握刀的右手五指尽落,因此他没能把怀剑插进自己的肚子里。<br/><br/> 源稚生面无表情地收刀回鞘,从怀里抽出手帕沿着断指根部扎紧来止血。他的刀术极精,一刀斩断橘正宗的五指,却还留下短短的指根来止血。<br/><br/> <br/><br/> 1937年12月,南京被攻克,之后的六个星期中。城里有三十万平民被屠杀。南京城里西方桥民的证词是审判战犯的关键证据,一位法国天主教堂的修女说,日军甚至冲进西方教堂开设的育婴堂。强暴藏身在里面的中国女人。老嬷嬷让中国女人们穿上修女的衣服,秘密地带他们出城。他们在江边被日本军队拦截,藤原胜少校发现他们都是假修女,于是所有女人都遭到了强暴,反抗者被用刺刀刨开了肚子。没有遭到侵害的只有带队的那位老嬷嬷,但她目睹了那血腥残酷的一幕后无法忍受,于是开枪自杀。死前她诅咒说神会惩罚罪人,用雷电用火焰……”<br/><br/> 【THEEND】<br/><br/><div id="p_ad_t3"><script language="javascript" type="text/javascript" src="/xm.js"></script></div></div>
<!--content-->
<div id="p_ad_b1" class="clearfix">
</div>
<div class="bottomlink clearfix">
<div class="linkbtn clearfix"> <h2><a href="http://www.********.cc/longzu3heiyuezhichaozhong/370.html"><span>(快捷键:←)上一页</span></a> <a href="http://www.********.cc/longzu3heiyuezhichaozhong/"><span>返回章节目录(快捷键:回车)</span></a> <a href=""><span>下一页(快捷键:→)</span></a></h2> </div>
</div>
<div class="bottomlink clearfix">
<div style="display:none;" id="divAjaxComment"></div>
<div class="post" id="divCommentPost">
<p class="posttop"><a name="comment">发表评论:</a></p>
<form id="frmSumbit" target="_self" method="post" action="http://www.********.cc/longzu3heiyuezhichaozhong/cmd.asp?act=cmt&key=32c3ee99" >
<input type="hidden" name="inpId" id="inpId" value="371" />
<input type="hidden" name="inpArticle" id="inpArticle" value="" />
<input type="hidden" name="inpLocation" id="inpLocation" value="" />
<p><input type="text" name="inpName" id="inpName" class="text" value="" size="28" tabindex="1" /> <label for="inpName">名称(必填)</label></p>
<p><input type="text" name="inpEmail" id="inpEmail" class="text" value="" size="28" tabindex="2" /> <label for="inpEmail">邮箱(可以不填写)</label></p>
<!--<p><input type="text" name="inpHomePage" id="inpHomePage" class="text" value="" size="28" tabindex="3" /> <label for="inpHomePage">网站链接</label></p>-->
<p><label for="txaArticle">正文(留言最长字数:1000)</label></p>
<p>
<textarea name="txaArticle" id="txaArticle" onchange="GetActiveText(this.id);" onclick="GetActiveText(this.id);" onfocus="GetActiveText(this.id);" class="text" cols="50" rows="4" tabindex="5" style="width:80%;resize:none;" ></textarea>
</p>
<p><input name="btnSumbit" type="submit" tabindex="6" value="提交" onclick="JavaScript:return VerifyMessage()" class="button" /> <input type="checkbox" name="chkRemember" value="1" id="chkRemember" /> <label for="chkRemember">记住我,下次回复时不用重新输入个人信息</label></p>
<script language="JavaScript" type="text/javascript">objActive="txaArticle";ExportUbbFrame();</script>
</form>
<p class="postbottom">◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。</p>
<script language="JavaScript" type="text/javascript">LoadRememberInfo();</script>
</div>
</div>
<div id="p_ad_b2" class="clearfix">
</div>
<!--页脚-->
<div class="footer clearfix"> <span class="page-comment">
</span> <span class="fright">
<div id="pagebottom">
</div>
</span> <span class="fleft gray-link"></script>Copyright 2015-2017 <a href="http://www.********.cc/longzu3heiyuezhichaozhong/">龙族3·黑月之潮(中)</a> all rights reserved <script language="javascript" type="text/javascript" src="//js.users.51.la/19241152.js"></script>
</span></div>
<div id="allbottom">
</div>
</body>
</html>
网站就不给看了用***替代一下,下面直接上代码
import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.*;
/*
爬取网站小说
*/
public class CaptureDemo {
public static void main(String[] args) {
for (int page = 345; page <= 360 ; page++) {
String url = "http://www.********.cc/longzu3heiyuezhichaoxia/"+page+".html";
String bookContent = getBookContent(url);
System.out.println(bookContent);
File file = new File("E:\\龙族3-黑月之潮(下).txt");
saveToLocal(bookContent, file);
System.out.println(url+" is over.");
}
}
// 保存数据到本地文件中
private static String saveToLocal(String bookContent, File file) {
FileWriter fw = null;
try {
// 如果文件存在就在文件中追加内容,不存在就创建
fw = new FileWriter(file,true);
fw.write(bookContent);
fw.flush();
fw.close();
return "scueess";
} catch (IOException e) {
e.printStackTrace();
}
return "failed";
}
// 获取目标信息
private static String getBookContent(String url) {
StringBuffer sb = new StringBuffer("\n");
// 爬取网页信息
CloseableHttpClient closeableHttpClient = HttpClients.createDefault();
try {
HttpGet httpGet = new HttpGet(url);
CloseableHttpResponse closeableHttpResponse = closeableHttpClient.execute(httpGet);
try {
// 获取响应实体
HttpEntity entity = closeableHttpResponse.getEntity();
// 打印响应状态
if (entity != null){
System.out.println(entity.toString());
// 将获取的网页数据以utf8编码读取出来
String html = EntityUtils.toString(entity, "utf8");
// Jsoup 解析网页数据
Document document = Jsoup.parse(html);
// 获取目标内容
Element bookText = document.getElementById("BookText");
// 章节标题
Elements chaptertitle = document.getElementsByClass("chaptertitle");
String headTitle = chaptertitle.text();
String content = bookText.text().replaceAll(" ","\n");
return sb.append(headTitle).append("\n").append(content).append("\n\n").toString();
}
}catch (Exception e){
e.printStackTrace();
}
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
仅做学习记录。
本文介绍了一种使用Java和Apache HttpClient、Jsoup库抓取网页小说内容的方法,通过解析HTML并提取所需章节文本,将其保存至本地文件。
373

被折叠的 条评论
为什么被折叠?



