Python利用bs4批量抓取网页图片并下载保存至本地

最新推荐文章于 2025-04-12 17:07:20 发布

Wispertise

最新推荐文章于 2025-04-12 17:07:20 发布

阅读量4.6k

点赞数 1

分类专栏： Python爬虫文章标签： python 开发语言后端

本文链接：https://blog.youkuaiyun.com/Sunny_Deer/article/details/121088848

版权

Python爬虫专栏收录该内容

4 篇文章

订阅专栏

Python利用bs4批量抓取网页图片并下载保存至本地

使用bs4抓取网页图片，bs4解析比较简单，需要预先了解一些html知识，bs4的逻辑简单，编写难度较低。本例以抓取某壁纸网站中的壁纸为例。(bs4为第三方库，使用前需要要自行安装)

步骤

拿到页面源代码，提取子页面的链接地址——>href
通过href拿到子页面的内容，再从子页面中找到图片的下载地址 img——>src
下载图片

首先导入必要的包

import requests
from bs4 import BeautifulSoup

准备url和响应

url = "https://umei.cc/bizhitupian/diannaobizhi/"
resp = requests.get(url)
resp.encoding = "UTF-8" #处理乱码

使用bs4提取数据

main_page = BeautifulSoup(resp.text, "html.parser")  
alist = main_page.find("div", class_="TypeList").find_all("a") #找到标签为TypeList中的内容，在从中寻找a标签

在这里插入图片描述

跳转到子页面
抓取图片下载地址
图片下载

count = 1 #计数
for a in alist:  #对每个提取到的a标签内的内容进行爬取
    href = "https://umei.cc" + a.get('href')  #利用url拼接实现页面的跳转
    child_resp = requests.get(href)  #获取响应
    child_resp.encoding = "UTF-8"  #更改编码格式，避免出现乱码
    child_mainpage = BeautifulSoup(child_resp.text, "html.parser")   #找到标签为TypeList中的内容
    child_alist = child_mainpage.find("p", align="center") #从中找到标签为p，align="center"后的内容）
    image = child_alist.find("img")  #找到img标签
    src = image.get("src")          #img标签下属性src中的内容
    image_resp = requests.get(src)		#请求该图片地址
    image_name = src.split("/")[-1]		#给图片取名（以下载地址尾部命名）
    with open("image/"+image_name, mode="wb") as f: #将响应的数据写入文件（地址可自行修改），即该文件就是图片
        f.write(image_resp.content)
    print(f"{count}张已完成！,{image_name}")  
    count = count + 1

在这里插入图片描述

改进

由于涉及到多次的请求，为了避免ip被网站屏蔽，可以令程序在抓取到一张图片后睡眠一段时间，并且显示所用时间。

import time
time_start = time.time() #程序开始执行时刻的时间
...
...
...
print(f"已用时{time.time()-time_start}s") #完成每一张图片时经过的时间
time.sleep(1)#使程序睡眠1s

完整源码：



import requests
from bs4 import BeautifulSoup
import time
count = 1     # 计数
url = "https://umei.cc/bizhitupian/diannaobizhi/"
resp = requests.get(url)
resp.encoding = "UTF-8"
main_page = BeautifulSoup(resp.text, "html.parser")
alist = main_page.find("div", class_="TypeList").find_all("a")
time_start = time.time()   # 程序开始执行时刻的时间
for a in alist:  # 对每个提取到的a标签内的内容进行爬取
    href = "https://umei.cc" + a.get('href')  # 利用url拼接实现页面的跳转
    child_resp = requests.get(href)  # 获取响应
    child_resp.encoding = "UTF-8"   # 更改编码格式，避免出现乱码
    child_mainpage = BeautifulSoup(child_resp.text, "html.parser")   # 找到标签为TypeList中的内容
    child_alist = child_mainpage.find("p", align="center")   # 从中找到标签为p，align="center"后的内容）
    image = child_alist.find("img")    # 找到img标签
    src = image.get("src")       # img标签下属性src中的内容
    image_resp = requests.get(src)   # 请求该图片地址
    image_name = src.split("/")[-1]   # 给图片取名（以下载地址尾部命名）
    with open("image/"+image_name, mode="wb") as f:   # 将响应的数据写入文件，即该文件就是图片
        f.write(image_resp.content)
    print(f"{count}张已完成！,{image_name}")
    print(f"已用时{time.time()-time_start}s")    # 完成每一张图片时经过的时间
    count = count + 1
    time.sleep(1)   # 使程序睡眠1s