scrapy 初体验

最新推荐文章于 2025-12-05 08:37:05 发布

原创最新推荐文章于 2025-12-05 08:37:05 发布 · 565 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#scrapy #python #爬虫

python 专栏收录该内容

6 篇文章

订阅专栏

scrapy是python的爬虫框架，当初为了在ubuntu下配置环境，花了我两天时间>_< 原因是14.04LTS版本内置2.7的py，然后改成3.4，以及在安装pip3，OpenSSL遇到了不少坑，不过所幸在百度上终于找到了答案，并成功安装。
兴奋过后，就要开始实干了，根据scrapy1.2文档，我学习翻阅了前两章，并写了一个简单的爬虫，当然，爬的还是panda.tv 23333~

接着上代码->

命令行输入（项目目录）：

scrapy startproject panda
cd panda

import scrapy

#定义数据类型
class Video(scrapy.Item):     
    title = scrapy.Field()
    name = scrapy.Field()
    population = scrapy.Field()
    category = scrapy.Field()

class PandaSpider(scrapy.Spider):
    name = 'panda'    #爬虫名字，唯一
    allowed_domains = ['panda.tv']

    start_urls = ['http://www.panda.tv/all']
    # def start_requests(self):
    #     yield scrapy.Request('http://www.panda.tv/all',self.parse)

    def parse(self,response):
        video = Video()
        # video = {}
        item = response.xpath('//a[@class="video-list-item-wrap"]')

        for info in item:
            subinfo = info.xpath('.//div[@class="video-info"]')
            video['title'] = info.xpath('.//div[@class="video-title"]/text()').extract()
            video['name'] = subinfo.xpath('.//span[@class="video-nickname"]/text()').extract()
            video['population'] = subinfo.xpath('.//span[@class="video-number"]/text()').extract()
            video['category'] = subinfo.xpath('.//span[@class="video-cate"]/text()').extract()
            yield {
                'title':video['title'],
                'name':video['name'],
                'population':video['population'],
                'category':video['category']
            }

        #翻页功能，有待完善
        # next_page = response.css('a.j-page-next::attr(href)').extract()
        # if next_page is not None:
        #   next_page = response.urljoin(next_page)
        #   yield scrapy.Request(next_page,self.parse)