scrapy框架爬取虎牙直播有关信息

最新推荐文章于 2022-11-14 14:06:31 发布

TamoR.

最新推荐文章于 2022-11-14 14:06:31 发布

阅读量396

点赞数 1

分类专栏： python爬虫

本文为博主原创文章，未经博主允许不得转载。https://blog.youkuaiyun.com/weixin_43576564

本文链接：https://blog.youkuaiyun.com/weixin_43576564/article/details/103555093

版权

主程序hy.py:

# -*- coding: utf-8 -*-
import scrapy
from ..items import sortItem,gameInfo,gameSonSort,houseInfo
from scrapy import Request
import re
from time import sleep


class HySpider(scrapy.Spider):
    name = 'hy'
    allowed_domains = ['huya.com']
    start_urls = ['http://huya.com/g']

    def parse(self, response):
        Sort=sortItem()
        urls = response.xpath("//div[@class='filter']/dl[1]/dd[position()=5]/a/@href").extract()
        names=response.xpath("//div[@class='filter']/dl[1]/dd[position()=5]/a/span/text()").extract()
        for  i in range(len(urls)):
            Sort['Surl']=urls[i]
            Sort['Sname']=names[i]
            yield Sort
            url=urls[i]
            yield response.follow(url,self.parseSort)

    def parseSort(self,response):
        game=gameInfo()
        gameName=response.xpath('//ul[@id="js-game-list"]/li/@title').extract()
        gameUrl=response.xpath('//ul[@id="js-game-list"]/li/a/@href').extract()
        gameImg=response.xpath('//ul[@id="js-game-list"]/li/a/img/@src').extract()
        gameGid=response.xpath('//ul[@id="js-game-list"]/li/a/@report').extract()
        str=re.compile('"game_id":"(.*)"}')
        for i in range(len(gameImg)-1):
            game['gam