Python爬虫_保姆级—教不会我转行卖火腿-优快云博客

本文链接：https://blog.youkuaiyun.com/LIN3456789/article/details/136743920

写一个简单的Python脚本，用于爬取彼岸图网（pic.netbian.com）上的图片并保存到本地。以下是对重要部分的注释：

导入必要的库：requests 用于发送 HTTP 请求，re 用于正则表达式匹配，os 用于操作系统相关功能。
设置请求头 headers，模拟浏览器访问。
定义函数 clean_filename 用于清理文件名中的特殊字符。
循环爬取第10页到第19页的内容。
构建每一页的URL。
发送HTTP GET请求获取页面内容，并设置编码。
使用正则表达式匹配页面中的图片链接和图片名字。
创建存储图片的文件夹。
遍历匹配到的图片链接和名字，并下载图片保存到本地。
打印输出图片下载成功的信息以及当前页爬取完成的信息。
最后输出所有页面爬取完成的提示信息。
完整代码如下：

import requests
import re
import os

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}

def clean_filename(filename):
    # 使用正则表达式去除文件名中的特殊字符
    cleaned_filename = re.sub(r'[\\/*?:"<>|]', '', filename)
    return cleaned_filename

for page in range(10, 20):  # 爬取前10页
    url = f"https://pic.netbian.com/index_{page}.html"
    response = requests.get(url=url, headers=headers)
    response.encoding = response.apparent_encoding

    """
    . 表示除空格外任意字符（除\n外）
    * 表示匹配字符零次或多次
    ? 表示匹配字符零次或一次
    .*? 非贪婪匹配
    """
    parr = re.compile('src="(/u.*?)".alt="(.*?)"')  # 匹配图片链接和图片名字
    image = re.findall(parr, response.text)

    path = "彼岸图网图片获取"
    if not os.path.isdir(path):  # 判断是否存在该文件夹，若不存在则创建
        os.mkdir(path)  # 创建

    # 对列表进行遍历
    for i in image:
        link = i[0]  # 获取链接
        name = clean_filename(i[1])  # 获取经过处理的名字
        # name = i[1]  # 获取名字
        with open(path + "/{}.jpg".format(name), "wb") as img:
            res = requests.get("https://pic.netbian.com" + link)
            img.write(res.content)  # 将图片请求的结果内容写到jpg文件中
            img.close()  # 关闭操作
        print(f"{name}.jpg 获取成功······")
    print(f"第{page}页爬取完成")
print("所有页面爬取完成")