爬取斗鱼--scrapy

本文介绍了如何利用Scrapy爬虫框架抓取斗鱼直播平台的主播信息,包括主播名、房间号、剧场名称和房间URL。首先访问斗鱼目录页面,解析JavaScript变量获取cate2Id,接着拼接URL进行爬取,抓取到的响应数据为JSON格式,通过JSON解析获取所需内容。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

总体流程:

爬取该网页所有的主播名,主播房间号,主播剧场名称,主播的房间的url路径 

首先打开https://www.douyu.com/directory

解析javascript里面变量

 得到cate2Id的值,进行拼接

https://www.douyu.com/gapi/rkc/directory/2_+cate2Id

然后进行爬取,获取的response为json数据,json解析获取。

部分spider代码如下:


class douyuspider(Spider):
    name="douyuspider"
    allowed_domain=["douyu.com"]
    # start_urls =['https://www.douyu.com/gapi/rkc/directory/2_1/0', 'https://www.douyu.com/gapi/rkc/directory/2_1/1', 'https://www.douyu.com/gapi/rkc/directory/2_1/2', 'https://www.douyu.com/gapi/rkc/directory/2_1/3', 'https://www.douyu.com/gapi/rkc/directory/2_1/4', 'https://www.douyu.com/gapi/rkc/directory/2_1/5', 'https://www.douyu.com/gapi/rkc/directory/2_1/6', 'https://www.douyu.com/gapi/rkc/directory/2_1/7', 'https://www.douyu.com/gapi/rkc/directory/2_1/8', 'https://www.douyu.com/gapi/rkc/directory/2_1/9', 'https://www.douyu.com/gapi/rkc/directory/2_1/10', 'https://www.douyu.com/gapi/rkc/directory/2_1/11', 'https://www.douyu.com/gapi/rkc/directory/2_1/12', 'https://www.douyu.com/gapi/rkc/directory/2_1/13', 'https://www.douyu.com/gapi/rkc/directory/2_1/14', 'https://www.douyu.com/gapi/rkc/directory/2_1/15', 'https://www.douyu.com/gapi/rkc/directory/2_1/16', 'https://www.douyu.com/gapi/rkc/directory/2_1/17', 'https://www.douyu.com/gapi/rkc/directory/2_1/18', 'https://www.douyu.com/gapi/rkc/directory/2_1/19', 'https://www.douyu.com/gapi/rkc/directory/2_1/20', 'https://www.douyu.com/gapi/rkc/directory/2_1/21', 'https://www.douyu.com/gapi/rkc/directory/2_1/22', 'https://www.douyu.com/gapi/rkc/directory/2_1/23', 'https://www.douyu.com/gapi/rkc/directory/2_1/24', 'https://www.douyu.com/gapi/rkc/directory/2_1/25', 'https://www.douyu.com/gapi/rkc/directory/2_1/26', 'https://www.douyu.com/gapi/rkc/directory/2_1/27', 'https://www.douyu.com/gapi/rkc/directory/2_1/28', 'https://www.douyu.com/gapi/rkc/directory/2_1/29']
    # start_urls = ['https://www.douyu.com/gapi/rkc/directory/2_270/1', 'https://www.douyu.com/gapi/rkc/directory/2_270/2', 'https://www.douyu.com/gapi/rkc/directory/2_270/3', 'https://www.douyu.com/gapi/rkc/directory/2_270/4', 'https://www.douyu.com/gapi/rkc/directory/2_270/5', 'https://www.douyu.com/gapi/rkc/directory/2_270/6', 'https://www.douyu.com/gapi/rkc/directory/2_270/7', 'https://www.douyu.com/gapi/rkc/directory/2_270/8', 'https://www.douyu.com/gapi/rkc/directory/2_270/9', 'https://www.douyu.com/gapi/rkc/directory/2_270/10', 'https://www.douyu.com/gapi/rkc/directory/2_270/11', 'https://www.douyu.com/gapi/rkc/directory/2_270/12', 'https://www.douyu.com/gapi/rkc/directory/2_270/13', 'https://www.douyu.com/gapi/rkc/directory/2_270/14', 'https://www.douyu.com/gapi/rkc/directory/2_270/15', 'https://www.douyu.com/gapi/rkc/directory/2_270/16', 'https://www.douyu.com/gapi/rkc/directory/2_270/17', 'https://www.douyu.com/gapi/rkc/directory/2_270/18', 'https://www.douyu.com/gapi/rkc/directory/2_270/19', 'https://www.douyu.com/gapi/rkc/directory/2_270/20', 'https://www.douyu.com/gapi/rkc/directory/2_270/21', 'https://www.douyu.com/gapi/rkc/directory/2_270/22', 'https://www.douyu.com/gapi/rkc/directory/2_270/23', 'https://www.douyu.com/gapi/rkc/directory/2_270/24', 'https://www.douyu.com/gapi/rkc/directory/2_270/25']
    # 类别的url路径,在该返回的值中找到各个类别的url
    start_urls =[ 'https://www.douyu.com/directory']
    def parse(self,response):
        if response.status == 200 :
            hrefs = response.body
            # js变量中提取出cate2Id进行拼接
            soup = BeautifulSoup(hrefs, "lxml")
            jsons = re.findall('{"cate2Name":.*?"isDisplay":[0,1]}',soup.text)
            cate1_urls = []
            # 拿到所有类别的url
            for json1 in jsons:
                print(json1)
                cate1_urls.append('https://www.douyu.com/gapi/rkc/directory/2_' + str(json.loads(json1).get('cate2Id')))
            # 拿到所有类别的每一页,由于不确定页数,默认取50
            for url in cate1_urls:
                for i in range(0,50):

                    yield Request(url+'/'+str(i),callback=self.parsePage,dont_filter=True)

 

在Python中爬取斗鱼房间的标题和人数通常需要借助网络爬虫技术,比如使用requests库获取网页内容,然后解析HTML数据,可以利用BeautifulSoup、Scrapy等库辅助。这里简单概述步骤: 1. **安装必要的库**: 首先确保已经安装了`requests`, `beautifulsoup4`, 和 `lxml` (如果BeautifulSoup版本较旧)。你可以通过命令行运行以下命令安装它们: ``` pip install requests beautifulsoup4 lxml ``` 2. **发送HTTP请求**: 使用requests库向斗鱼直播平台的房间页面发送GET请求,获取HTML源码。 ```python import requests from bs4 import BeautifulSoup url = 'https://www.douyu.com/directory/all' # 示例URL,替换为具体的房间列表页 response = requests.get(url) ``` 3. **解析HTML**: 使用BeautifulSoup解析返回的HTML,找到包含房间信息的部分。这通常涉及到查找特定的HTML标签,如`<div>`标签下的`class`或`id`可能对应着房间标题和人数。 ```python soup = BeautifulSoup(response.text, 'lxml') room_data = soup.find_all('div', class_='room-item') # 根据实际HTML结构定位元素 ``` 4. **提取信息**: 对每个房间元素,提取出标题(可能是`<a>`标签内的文本)和人数(可能隐藏在JavaScript里,需要分析DOM结构或使用类似Selenium的工具获取动态数据)。 ```python titles = [element.find('a').text for element in room_data] popularity_numbers = [] # 这部分可能需要进一步处理,因为斗鱼人数通常是实时变化的 ``` 5. **处理动态数据**: 如果人数显示在JavaScript渲染的内容中,你可能需要使用如Selenium这样的库模拟浏览器行为,获取真实的数据。 注意,网络爬虫操作需遵守网站的Robots协议,并尊重版权和用户隐私,频繁大量爬取可能会导致IP被封禁。在实际使用时,最好能结合网站提供的API,如果没有则要确保合法合规地抓取。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值