23hh小说网——爬虫0.1python

最新推荐文章于 2025-06-12 09:07:10 发布

weixin_30651273

最新推荐文章于 2025-06-12 09:07:10 发布

阅读量328

点赞数

CC 4.0 BY-SA版权

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/SilenceCity/p/3640398.html

本文介绍了一个简单的Python爬虫程序，用于从特定小说网站抓取章节内容并转换为可读格式。该爬虫能自动获取网页编码，解析小说正文，去除HTML标签，实现基本的翻页功能。

这个是初版，就是把这个小说网上正在看的页面给抓下来。

 1 #! /bin/python
 2 # -*- coding:utf-8 -*-
 3 
 4 # --------------------------------------------
 5 #     程序：【看小说】爬虫
 6 #     版本：0.1
 7 #     作者：Silence
 8 #     日期：2014-03-30
 9 #     操作：输入quit退出
10 #     功能：自动把指定网站上的指定小说内容给抓下来，并显示出来
11 #    修改：0.2 提供一个目录页，把目录页中所有的目录章节都抓出来，并存储为txt文件====
12 # ---------------------------------------------
13 
14 import re
15 import urllib2
16 
17 from urlparse import urlparse
18 
19 class Novel_Tool:
20 
21     def __init__(self,weburl):
22         self.url = weburl
23         self.headers = {
24             'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
25         }
26         self.newLineChar = ''
27 
28     #获取当前页面的编码格式，现在某些小说网喜欢用gbk来编码
29     # 但是，也有一些不厚道的网站，他们的实际编码格式不是在页面中charset中定义的格式，暂时先忽略了
30     def getPageType(self,content):
31         pattern = re.compile('charset=.*?"')
32         pagetype = pattern.search(content).group()
33         pagetype = pagetype[8:len(pagetype) - 1]
34         return pagetype
35 
36     def start(self):
37         # 得到utf-8编码的小说返回文本
38         req = urllib2.Request(
39             url = self.url,
40             headers = self.headers
41         )
42         myResponse = urllib2.urlopen(req).read()
43         # print myResponse
44         decodeResp = myResponse.decode(self.getPageType(myResponse)).encode('utf-8')
45 
46         contentPattern = re.compile('(<dd id="contents">)((.|\s)*?)</dd>')
47 
48         content = contentPattern.search(decodeResp).group(2)
49         content = self.replaceWebTag(content)
50         print content
51         # print self.getNextPage(decodeResp)
52 
53     # 获取下一页的地址
54     def getNextPage(self,content):
55         # 先获取到下一页的位置
56         footlinkRex = re.compile('(footlink">)(.*?)</dd>')
57         foot = footlinkRex.search(content).group(2)
58         pattern = re.compile(r'(返回目录.*?(<a.*?">下一页))')
59         m = pattern.search(foot).groups()
60         nextUrl = m[len(m)-1][9:m[len(m)-1].find('">')]
61 
62         return self.url[0:self.url.rfind('/')+1] + nextUrl
63 
64     def replaceWebTag(self,content):
65         charToNoneRex = re.compile(r'&nbsp;')
66         charToNewLineRex = re.compile("<br />|<br>|<br/>")
67 
68         content = charToNoneRex.sub("",content)
69         content = charToNewLineRex.sub("\n",content)
70         return content
71 
72 
73 if __name__ == '__main__':
74     print u"""
75 # --------------------------------------------
76 #     程序：【看小说么】爬虫
77 #     版本：0.1
78 #     作者：Silence
79 #     日期：2014-03-30
80 #     操作：输入quit退出
81 #     功能：自动把指定网站上的指定小说内容给抓下来
82 # ---------------------------------------------"""
83 
84     myinput = raw_input('请输入要爬得小说初始目录页面(默认为第一章的页面)\n')
85     if myinput == '':
86         myinput = 'http://www.23hh.com/book/43/43957/15468368.html'
87         # myinput = 'http://www.23hh.com/book/43/43957/12229083.html'
88     nodel = Novel_Tool(myinput)
89     nodel.start()