还是忍不住对B站下手了,来迎接我的爬取吧!!!

本文介绍了一种使用Python Scrapy框架爬取B站娱乐直播封面图和关键帧的方法,包括项目创建、代码分析、网页请求分析及图片下载流程。通过解析JSON数据,实现了多页数据抓取,并利用自定义pipeline处理图片存储。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Python爬虫:B站娱乐直播封面图、关键帧任你爬

一、准备

创建scrapy项目

scrapy startproject BiliBili

cd BiliBili 

scrapy genspider bz "bilibili.com"

创建启动python文件
start.py

from scrapy import cmdline


cmdline.execute("scrapy crawl bz".split())

在这里插入图片描述
修改及补充完整settings.py文件
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

准备完毕,进入分析阶段。

二、代码及网页分析

1. 网页分析
进入b站娱乐分区,因部分模块主播数量不同,所以准备直接爬取全部。
在这里插入图片描述
按F12打开开发者功能工具选择XHR
在这里插入图片描述

逐条查询请求,找到带有数据的请求
在这里插入图片描述

在这里插入图片描述

在网页请求该url得到我们需要的Json数据
在这里插入图片描述

复制Json数据到在线json工具进行格式化校验
在这里插入图片描述
分析Json数据
在这里插入图片描述

在界面往下滑动滚轮,可看到请求的数量增多,往下滑找到第二条带有数据的请求
在这里插入图片描述
分析请求url
在这里插入图片描述

分析数据位置
在这里插入图片描述

2. 代码分析
bz.py

# -*- coding: utf-8 -*-
import scrapy
import json
from BiliBili.items import BilibiliItem

class BzSpider(scrapy.Spider):
    name = 'bz'
    allowed_domains = ['bilibili.com']
    #请求URL
    start_urls = ['https://api.live.bilibili.com/room/v3/area/getRoomList?platform=web&parent_area_id=1&cate_id=0&area_id=0&sort_type=sort_type_152&page=1&page_size=30']
    num = 1

    def parse(self, response):
    	#获取list,然后遍历得到我们需要的数据
        data_list = json.loads(response.text)["data"]["list"]

        for data in data_list:
            uname = data['uname']  #主播名称
            user_cover = data["user_cover"]  #封面url
            system_cover = data["system_cover"]  #关键帧url

            item = BilibiliItem(uname=uname,user_cover=user_cover,system_cover=system_cover)
            yield item

		#请求多页数据
        self.num += 1
        url = "https://api.live.bilibili.com/room/v3/area/getRoomList?platform=web&parent_area_id=1&cate_id=0&area_id=0&sort_type=sort_type_152&page=" + str(self.num) + "&page_size=30"
		
		#限制请求页数
        if self.num <= 4:
            yield scrapy.Request(url=url,callback=self.parse)

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from BiliBili import settings
import os

class BilibiliPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        uname = item["uname"]
        img_cover = item["user_cover"]
        #发送封面图url请求,带上主播名称
        yield scrapy.Request(img_cover,meta={"uname":uname})

        img_crux = item['system_cover']
        #发送关键帧url请求,带上主播名称
        yield scrapy.Request(img_crux,meta={"uname":uname})
		
		#之后发送的请求顺序为
		#主播1封面图url
		#主播1关键帧url
		#主播2封面图url
		#主播2关键帧url

    def file_path(self, request, response=None, info=None):
    	#截取图片名称
        file_name = request.url.split('/')[-1]
        
        #根据测试有些图片名称格式为 xxx.jpg?xxxx,我们需要对此类图片名称也进行修改
        file_name = file_name.split("?")[0]
        
		#得到传过来的主播名称
        category = request.meta['uname']
		#得到settings文件中设置的保存路径
        images_store = settings.IMAGES_STORE
        
        #路径拼接
        category_path = os.path.join(images_store,category)

		#我们的下载图片保存会在默认路径下创建主播名称的文件夹,在其中放入图片
		#根据发送请求的顺序,我们根据有无相同主播名称的文件夹进行判断区分封面图url和关键帧url
        if not os.path.exists(category_path):
        	#将图片拼接到文件夹下
            image_name = os.path.join(category, file_name)
            #截取到图片名称并进行替换
            name = image_name.split("\\")[1].split(".")[0]
            image_name = image_name.replace(name,"封面图")
            #返回封面图图片名称
            return image_name
        else:
            image_name02 = os.path.join(category, file_name)
             #截取到图片名称并进行替换
            name1 = image_name02.split("\\")[1].split(".")[0]
            image_name02 = image_name02.replace(name1, "关键帧")
            #返回关键帧图片名称
            return image_name02

爬取的结果为:
在这里插入图片描述

三、完整代码

bz.py

# -*- coding: utf-8 -*-
import scrapy
import json
from BiliBili.items import BilibiliItem
import requests


class BzSpider(scrapy.Spider):
    name = 'bz'
    allowed_domains = ['bilibili.com']
    start_urls = ['https://api.live.bilibili.com/room/v3/area/getRoomList?platform=web&parent_area_id=1&cate_id=0&area_id=0&sort_type=sort_type_152&page=1&page_size=30']
    num = 1

    def parse(self, response):
        print(response)
        data_list = json.loads(response.text)["data"]["list"]

        for data in data_list:
            uname = data['uname']
            user_cover = data["user_cover"]
            system_cover = data["system_cover"]

            item = BilibiliItem(uname=uname,user_cover=user_cover,system_cover=system_cover)
            yield item
        self.num += 1
        url = "https://api.live.bilibili.com/room/v3/area/getRoomList?platform=web&parent_area_id=1&cate_id=0&area_id=0&sort_type=sort_type_152&page=" + str(self.num) + "&page_size=30"

        if self.num <= 4:
            yield scrapy.Request(url=url,callback=self.parse)


items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BilibiliItem(scrapy.Item):
    uname = scrapy.Field()   #主播名称
    user_cover = scrapy.Field()  #封面图
    system_cover = scrapy.Field()  #关键帧

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from BiliBili import settings
import os

class BilibiliPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        uname = item["uname"]
        img_cover = item["user_cover"]
        yield scrapy.Request(img_cover,meta={"uname":uname})

        img_crux = item['system_cover']
        yield scrapy.Request(img_crux,meta={"uname":uname})

    def file_path(self, request, response=None, info=None):
        file_name = request.url.split('/')[-1]
        file_name = file_name.split("?")[0]
        category = request.meta['uname']
        images_store = settings.IMAGES_STORE
        category_path = os.path.join(images_store,category)
        # print(category_path)
        # print("="*20)
        if not os.path.exists(category_path):
            image_name = os.path.join(category, file_name)
            name = image_name.split("\\")[1].split(".")[0]
            image_name = image_name.replace(name,"封面图")
            return image_name
        else:
            image_name02 = os.path.join(category, file_name)
            name1 = image_name02.split("\\")[1].split(".")[0]
            image_name02 = image_name02.replace(name1, "关键帧")
            return image_name02

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for BiliBili project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'BiliBili'

SPIDER_MODULES = ['BiliBili.spiders']
NEWSPIDER_MODULE = 'BiliBili.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'BiliBili (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

LOG_LEVEL = "ERROR"

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'BiliBili.middlewares.BilibiliSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'BiliBili.middlewares.BiliBili': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'BiliBili.pipelines.BilibiliPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

IMAGES_STORE = "Download"

本次项目到此结束,觉得不错的小伙伴可以点赞关注收藏哦!

关注博主,博主日后会继续发表文章供大家阅读。

博主更多博客

### 实现MOOC平台网页爬虫的方法 为了实现针对MOOC平台的网页爬虫,可以采用多种技术和工具来完成这一目标。下面介绍一种基于`requests`库和Selenium WebDriver相结合的方式来进行数据抓取。 #### 使用`requests`库获取静态页面内容 对于不需要JavaScript渲染就能加载全部所需信息的简单页面来说,可以直接通过发送HTTP请求并解析返回的内容来获得想要的信息。这可以通过Python内置的标准库urllib或者第三方库如`requests`轻松做到[^3]。 ```python import requests def fetch_page(url): try: response = requests.get(url) response.raise_for_status() return response.text except Exception as e: print(f"Failed to fetch page from {url}, error: ", str(e)) return None ``` #### 利用Selenium处理动态加载内容 当遇到依赖于JavaScript执行才能显示完全的复杂交互型网时,则需要用到像Selenium这样的自动化测试工具去模拟浏览器行为,从而能够访问这些由JS生成的数据[^1]。 ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.edge.service import Service as EdgeService from webdriver_manager.microsoft import EdgeChromiumDriverManager def setup_driver(): options = webdriver.EdgeOptions() service = EdgeService(executable_path=EdgeChromiumDriverManager().install()) driver = webdriver.Edge(service=service, options=options) return driver def get_mooc_data(browser="edge", url="", search_text="", sum_pages=1, num_images=0): driver = setup_driver() try: driver.get(url) # 假设有一个输入框用于搜索课程名称 input_element = driver.find_element(By.NAME, "searchText") # 需要根据实际HTML结构调整定位方式 input_element.send_keys(search_text) submit_button = driver.find_element(By.CSS_SELECTOR, ".submit-button-class-name") # 同样需要调整为真实的CSS选择器 submit_button.click() for i in range(sum_pages): # 处理每一页的结果... pass print("Current Browser Version:", get_browser_version()) finally: driver.quit() if __name__ == '__main__': browser = "edge" url = "https://www.icourse163.org" search_text = "Python语言程序设计" sum_pages = 3 num_images = 3 get_mooc_data(browser=browser, url=url, search_text=search_text, sum_pages=sum_pages, num_images=num_images) ``` 上述代码片段展示了如何设置驱动程序以及基本的操作流程,具体细节还需要依据目标点的具体情况进行适当修改。此外,在某些情况下可能还需要考虑验证码识别等问题[^5]。 #### 注意事项 - **遵循robots.txt规则**:在开始任何类型的Web Scraping之前,请务必查看目标网根目录下的`robots.txt`文件,了解哪些路径允许被爬取。 - **尊重API限流策略**:如果提供了官方API接口的话,应该优先尝试使用它们而不是直接对前端界面下手;同时要注意频率控制以免给服务器带来过大压力。 - **合法合规性审查**:确保自己的操作符合法律法规的要求,并且不会侵犯他人的知识产权或其他合法权益。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值