python 爬虫网络小说下载(静态网站)

最新推荐文章于 2025-07-02 16:49:11 发布

醉卧山林的执刀人

最新推荐文章于 2025-07-02 16:49:11 发布

阅读量1.8k

点赞数

CC 4.0 BY-SA版权

分类专栏： python 文章标签： python

本文链接：https://blog.youkuaiyun.com/qqqxiaobaiji/article/details/78752662

python 专栏收录该内容

1 篇文章

订阅专栏

本文详细介绍Python爬虫的基础准备，包括Python环境搭建、基础知识学习、常用爬虫模块安装及使用，通过实例演示如何抓取网页数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

原文章出处

http://blog.youkuaiyun.com/c406495762/article/details/78123502
github地址：https://github.com/Jack-Cherish/python-spider

这里说一下自己的理解

1.准备工作
<1>python下载，参考廖雪峰的官方网站安装教程，采用的是python3.6.0版本
https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/0014316090478912dab2a3a9e8f4ed49d28854b292f85bb000
<2>python基础知识，参考廖雪峰的python3教程，也可以在做爬虫的时候边做边看
https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000
<3>爬虫所用模块安装,采用requests模块和BeautifulSoup模块，文档地址，简单了解，此爬虫用到的是一些简单方法，下面会有一些简单介绍，再使用的时候查看一下文档说明
安装模块 cmd pip3 install beautifulsoup4
pip3 install requests
http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
http://beautifulsoup.readthedocs.io/zh_CN/latest/
<4>html,css基础知识，浏览器审查元素等，参照此文最上面原文章，做一些简单了解，详细了解的话可以看一下w3c http://www.w3school.com.cn/
<5>开发工具，选择一款开发工具，推荐使用pycharm
<6>附上fiddler的一些使用方法http://blog.youkuaiyun.com/ohmygirl/article/details/17846199
<7>fiddler可能捕捉不到https请求，需要设置证书https://www.cnblogs.com/zichi/p/4992885.html
<8>用到的方法

导入模块 import requests
    r = requests.get('https://github.com/timeline.json')
此爬虫比较简单，用到的只有get方法，像一个url发起请求，从r中获取我们想要的内容，#为注释
    r.text#可以获取响应的内容如抓回来的网页
    r. encoding='utf-8'#有时回来的是乱码，改变编码以使其正常显示根据实际情况改变编码utf-8、gb2312等
    r. content#可以获取二进制内容，如抓取登陆时的验证码等非字符资源
    r.cookies#可以查看当前保存的cookie情况
    r. status_code#可以查看HTTP状态码（如200 OK、404 Not Found等）
    r.url#可以查看当前请求的网址
<6.2>Beautiful Soup模块
    Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库
    导入 from bs4 import BeautifulSoup 从bs4模块中导入BeautifulSoup ，python中使用变量不需要定义类型，
    ret = requests.get('http://www.biqukan.com/1_1094/')
    html = ret.text
    bf = BeautifulSoup(html,'html.parser')
    使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:
    texts = bf.find_all('div',class_='listmain')
    找到所有div中class是listmain的那些，
    a_bf = BeautifulSoup(str(texts[0]),'html.parser')
    a_texts = a_bf.find_all('a')
    找到所有div中class是listmain的那些中的所有a标签
    a_texts = a_texts [0].text.replace('\xa0'*8,'\n\n')
    替换，空格替换为换行
<6.3>python方法
a_texts[15:]  去掉前15个
len()个数
for循环
在循环中的必须有4个空格
    sum = 0
    for x in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:
        sum = sum + x
        print(sum)
读文件 f = open('/Users/michael/test.txt', 'r')标示符'r'表示读
    with open('/Users/michael/test.txt', 'r') 自动调用f.close()方法
    f.write(each.string+'\n')  写文件
        f.writelines(download_text) 
        f.write('\n\n')

代码

import requests
import sys
from bs4 import BeautifulSoup

server = 'http://www.biqukan.com/'
ret = requests.get('http://www.biqukan.com/1_1094/')
html = ret.text
bf = BeautifulSoup(html,'html.parser')
texts = bf.find_all('div',class_='listmain')
a_bf = BeautifulSoup(str(texts[0]),'html.parser')
a_texts = a_bf.find_all('a')
self_nums = len(a_texts[15:])
for each in a_texts[15:]:
    print(each.string,server+each.get('href'))
    download_req = requests.get(str(server+each.get('href')))
    download_html = download_req.text
    download_bf = BeautifulSoup(download_html,'html.parser')
    download_text = download_bf.find_all('div',class_='showtxt')
    download_text = download_text[0].text.replace('\xa0'*8,'\n\n')
    with open('一念永恒.txt','a',encoding='utf-8') as f:
        f.write(each.string+'\n')
        f.writelines(download_text)
        f.write('\n\n')