第一个脚本时getlink.py,它的功能是获取各章节的链接。打开“当时明月”的blog,点击“我的所有文章”,作者所有发表的文章将被分页列出,其中标题形如“.?长篇.?明朝的那些事儿-历史应该可以写得好看/[(/d*)-(/d*)/]”都是该长篇的章节。getlink.py所要做的就是将这些章节及相应链接保存到links.dat中备用。
getlink.py
1
#
-*- coding: utf-8 -*-
2
import
urllib,re,os,sys,pickle
3
from
xml.dom.minidom
import
parse
4
import
xml.dom.minidom
5
6
uid
=
'
1233526741
'
#
当时明月的ID
7
8
#
读取章节及对应的链接
9
chapters
=
{}
#
存储章节及链接,如chapters['0001-0010']='/u/49861fd5010003ii'
10
for
i
in
range(
1
,
100
):
11
filehandle
=
urllib.urlopen(
'
http://blog.sina.com.cn/sns/service.php?m=aList&uid=%s&sort_id=0&page=%d
'
%
(uid,i))
12
myDoc
=
parse(filehandle)
13
myRss
=
myDoc.getElementsByTagName(
"
rss
"
)[0]
14
items
=
myRss.getElementsByTagName(
"
item
"
)
15
for
item
in
items:
16
title
=
item.getElementsByTagName(
"
title
"
)[0].childNodes[0].data
17
link
=
item.getElementsByTagName(
"
link
"
)[0].childNodes[0].data
18
match
=
re.search(ur
'
.?长篇.?明朝的那些事儿-历史应该可以写得好看/[(/d*)-(/d*)/]
'
,title)
19
if
match:
20
#
print match.group(1),":",match.group(2)
21
chapters[
'
%04d-%04d
'
%
(int(match.group(
1
)),int(match.group(
2
)))]
=
item.getElementsByTagName(
"
link
"
)[0].childNodes[0].data
22
23
#
将chapters保存到文件中以备用
24
output
=
open(
'
links.dat
'
,
'
wb+
'
)
25
pickle.dump(chapters, output)
26
output.close()

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

当取得links.dat后,下面利用bookdownload.py将所有章节的内容下载下载并整理成最后的全文文件 mingthings.txt。这一脚本的关键时第19行,从下载的内容中取得每一篇的实际内容(去除广告、脚本等)。通过分析每篇文章的html源文件发现,我们所要的东西都位于<div id="articleTextxxxxxxxxxxxxxxxx">...</div>中,其中xxxxxxxxxxxxxxxx即为该文链接,因此用正则表达式即可获取该篇的实际内容。
bookdownload.py
1
#
-*- coding: utf-8 -*-
2
import
urllib,re,os,sys,pickle
3
from
xml.dom.minidom
import
parse
4
import
xml.dom.minidom
5
6
uid
=
'
1233526741
'
#
当时明月的ID
7
8
#
读取章节及对应的链接
9
chapters
=
{}
#
存储章节及链接,如chapters['0001-0010']='/u/49861fd5010003ii'
10
links
=
open(
'
links.dat
'
,
'
rb+
'
)
#
links.dat由getlinks.py生成
11
chapters
=
pickle.load(links)
#
从links.dat读取章节及对应链接信息到chapters
12
13
book
=
open(
'
mingthings.txt
'
,
'
w+
'
)
#
mingthings.txt即为最终要生成的全文
14
for
chapter
in
sorted(chapters):
15
print
chapter
#
输出当前正在处理的章节
16
webpage
=
urllib.urlopen(
'
http://blog.sina.com.cn
'
+
chapters[chapter]).read().decode(
'
utf-8
'
)
17
18
#
s: Dot match new line; i: Case insenstive; m: ^$ match at linebreaks
19
match
=
re.search(ur
'
(?siLu).*<div id="articleText
'
+
chapters[chapter][
3
:]
+
'
".*?>(.*?)</div>.*
'
,webpage)
20
if
match:
21
text
=
match.group(
1
)
#
获取每章的内容
22
23
#
整理每章内容
24
text
=
re.sub(ur
'
(?sLu)<(.*?)>
'
,
''
,text)
25
text
=
re.sub(ur
'
(?sLu)( )+
'
,
'
'
,text)
26
text
=
re.sub(ur
"
(?Lum)^( +)
"
,
""
, text)
27
text
=
re.sub(ur
'
(?Lum)^(/s+)
'
,
''
,text)
28
text
=
re.sub(ur
'
(?siLu)(.?长篇.?明朝的那些事儿-历史应该可以写得好看/[/d*])
'
,r
'
/r/n/1/r/n
'
,text)
29
text
=
re.sub(ur
'
(?Lum)^(.*)$
'
,ur
'
/1
'
,text)
30
31
book.write(text.encode(
'
gbk
'
,
'
ignore
'
)
+
"
/r/n/r/n
"
)
32
book.flush
33
34
book.close()

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

每天依次运行一遍上面的两个脚本,《明朝的那些事儿-历史应该可以写得好看》的最新全文就在你的硬盘上了。
总之,正则表达式是下载长篇分章节网络小说的利器。
<script src="http://pagead2.googlesyndication.com/pagead/show_ads.js" type="text/javascript"> </script> name="google_ads_frame" marginwidth="0" marginheight="0" src="http://pagead2.googlesyndication.com/pagead/ads?client=ca-pub-4210569241504288&dt=1182915610046&lmt=1182915608&format=468x60_as&output=html&correlator=1182915610046&url=http%3A%2F%2Fwww.cnblogs.com%2Fagateriver%2Farchive%2F2006%2F09%2F03%2F493800.html&color_link=6699CC&ad_type=text&ref=http%3A%2F%2Fwww.cnblogs.com%2Fagateriver%2Farchive%2F2005%2F08%2F29%2F225565.html&cc=24&flash=9&u_h=1024&u_w=1280&u_ah=957&u_aw=1280&u_cd=32&u_tz=480&u_his=2&u_java=true&u_nplug=30&u_nmime=151" frameborder="0" width="468" scrolling="no" height="60" allowtransparency="allowtransparency">
style="BORDER-RIGHT: rgb(102,102,102) 0px solid; BORDER-TOP: rgb(102,102,102) 0px solid; BORDER-LEFT: rgb(102,102,102) 0px solid; BORDER-BOTTOM: rgb(102,102,102) 0px solid" marginwidth="0" marginheight="0" src="http://www.cnblogs.com/ad.htm" frameborder="0" width="468" scrolling="no" height="80" allowtransparency="allowtransparency">