利用爬虫采集数据

最新推荐文章于 2025-02-10 09:44:53 发布

Csxyzj

最新推荐文章于 2025-02-10 09:44:53 发布

阅读量913

点赞数 1

文章标签：爬虫

本文链接：https://blog.youkuaiyun.com/Csxyzj/article/details/143993631

版权

结合第三方模块requests，文件IO、正则表达式，通过函数封装爬虫应用采集数据
优快云中对代码进行总结：
1. 需求分析：如何确定采集的URL地址和数据的
2. 代码实现：描述包含详细注释的代码
3. 结果呈现：截图展示采集数据

以采集网络小说为例，首先写一个采集目录的爬虫

import requests
import json
#目录网站
url = "https://www.xzmncy.com/list/53018/"
#伪造代理
headers = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
}
#发送伪造请求
response = requests.get(url,headers=headers)
#设置响应编码
response.encoding = "UTF-8"
#查看响应数据
content = response.text
#print(content)
#正则提取章节名和目录
import re
p = r'<a href="(.*?)"\s+title="(.*?)">'
#全部匹配的方式提取数据
chs = re.findall(p,content,re.DOTALL)
chs.remove(('javascript:void(0)" onClick="sethome()">将本站设为首页</a></div>\n\t\t\t<div class="addfavorite"><a href="javascript:void(0)" onClick="addFavorite()">收藏笔趣阁</a></div>\n\t\t\t<script>login_page();</script>\n\t\t</div>\n\t</div>\n\t<div class="header">\n\t\t<div class="logo"><a href="https://www.xzmncy.com/">笔趣阁</a></div>\n\t\t<script>search_page();</script>\n\t</div>\n\t<div class="nav">\n\t\t<ul>\n\t\t\t<li><a href="https://www.xzmncy.com/">首页</a></li>\n\t\t\t<li><a rel="nofollow" href="/user/bookcase.html">我的书架</a></li>\n\t\t\t<li><a href="/xuanhuan/">玄幻小说</a></li>\n\t\t\t<li><a href="/wuxia/">武侠小说</a></li>\n\t\t\t<li><a href="/dushi/">都市小说</a></li>\n\t\t\t<li><a href="/lishi/">历史小说</a></li>\n\t\t\t<li><a href="/kehuan/">科幻小说</a></li>\n\t\t\t<li><a href="/wangyou/">网游小说</a></li>\n\t\t\t<li><a href="/yanqing/">言情小说</a></li>\n\t\t\t<li><a href="/paihang/allvisit/1/">排行榜单</a></li>\n\t\t\t<li><a href="/quanben/allvisit/1/">全本小说</a></li>\n\t\t\t<li><a rel="nofollow" href="/user/readlog.html">阅读记录</a></li>\n\t\t</ul>\n\t</div>\n</div>\n<div class="banner"><script>list1();</script></div>\n\n<div id="info">\n    <div class="position"><a href="https://www.xzmncy.com/">笔趣阁</a> &gt; <a href="/wuxia/">武侠小说</a> &gt; <a href="/list/53018/">幽冥古神</a></div>\n    <div class="info">\n        <div class="infobar">\n            <h1>幽冥古神</h1>\n            <p>作&nbsp;&nbsp;者：莫问无心</p>\n            <p>动&nbsp;&nbsp;作：<a href="Javascript:void(0);" onclick="book.addBook(\'53018\');">加入书架</a>, <a href="Javascript:void(0);" onclick="book.userVote(\'53018\');">投推荐票</a>, <a href="#footer">直达底部</a></p>\n            <p>类&nbsp;&nbsp;别：<a href="/wuxia/">武侠小说</a></p>\n            <p>最后更新：2024-11-20 09:24</p>\n            <p>最新章节：<a href="/list/53018/26338294.html">第六百二十三章 偷学</a></p>\n        </div>\n        <div class="intro">\n            <p>上古大陆，一位拥有先天全体的少年，因为一次意外，引得黑珠入体，从而导致元力全失，至此他失去了所有光彩，族人的陷害让他认识了师尊，在一个分身的教导下，少年披荆斩棘，过五关斩六将，一步步成为真正的强者......为了家族命运，少年踏上了独自修炼的征程；为了亲人，他被迫选择了自己不爱的人；为了爱人，他忍受了无数年的自责。从一个小小的战士，逐步成长为天元大陆至高无上的古神，而最终，为了整个天元大陆，他却付出了所有......</p>\n            <p>本站提示：各位书友要是觉得《<a href="/list/53018/">幽冥古神</a>》还不错的话请不要忘记向您QQ群和微博里的朋友推荐哦！</p>\n        </div>\n    </div>\n\n    <div class="sidebar">\n        <div class="pic">\n            <img width="120" height="150" src="https://www.xzmncy.com/img/53/53018.jpg" onerror="this.src=\'/images/nocover.jpg\'" alt="幽冥古神"/>\n            <span class="b"></span>                    </div>\n    </div>\n</div>\n\n<div class="banner"><script type="text/javascript">list2();</script></div>\n\n<div id="list">\n    <dl>\n        <dt>《幽冥古神》章节列表</dt>\n                <dd><a href="/list/53018/24932622.html', '第一章 乌海镇'))
chs = [('/list/53018/24932622.html', '第一章 乌海镇')] + chs
print(chs)
chapter = dict()
for ch in chs :
    chapter[ch[1]] = "https://www.xzmncy.com/" + ch[0]
print(chapter)
#文件IO中保存数据
with open("chapter.txt",mode="wt",encoding="UTF-8") as file:
    json.dump(chapter,file)

然后再写一个采集章节的爬虫

import random

import requests
import re
import time
import json
#加载需要的目录
with open("chapter.txt",encoding="UTF-8") as file :
    chs = json.load(file)
#print(chs)
#循环遍历，发起伪造请求
headers = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
}
for title,url in chs.items() :
    print(f"准备采集：{title}")
    #发起伪造请求
    response = requests.get(url,headers=headers)
    #设置编码
    response.encoding = "UTF-8"
    #分析数据格式
    html = response.text
    #print(html)
    print("------------------------------------")
    #定义正则，采集数据
    p = r'<div id="htmlContent">(.*?)</div>'
    html = re.search(p,html,re.DOTALL)
    #数据筛选
    html = html.group(1).strip()
    #数据清洗
    p2 = r'<br>(.*?)<br>'
    html = re.findall(p2,html,re.DOTALL)
    #print(html)
    html = "\n".join(html)
    #print(html)
    with open("幽冥古神.txt",mode="at",encoding="UTF-8") as file :
        #保存到文件
        file.write("\n\n--------------------------------------\n\n")
        #标题
        file.write("\n\n" + title + "\n\n")
        #内容
        file.write(html)

    #模拟用户请求，每次请求完成休眠3~5秒
    time.sleep(random.randint(a=3,b=5))
    print(f"{title}章节采集完成")
    #测试，采集一次数据
    #break

采集结果如下