2021/6/10爬虫第二十二次课（crawlspider、scrapy实现登录）

最新推荐文章于 2024-05-05 19:53:58 发布

原创

最新推荐文章于 2024-05-05 19:53:58 发布 · 308 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#crawlspider #scrapy

本文介绍Scrapy框架下的crawlspider组件使用方法及其优势，并详细讲解Scrapy如何实现登录功能，包括利用cookie和POST请求两种方式。

文章目录

一、crawlspider
二、scrapy实现登录

一、crawlspider

引入：
回顾之前的代码中，我们有很大一部分时间在寻找下一页的url地址或者是详情页的url地址上面，这个过程能更简单一些么？

定义：
是scrpay另一种爬取数据的方式

学习目标：
了解crawlspider的使用
crawlspiser是继承与spider这个爬虫类

它的特点：
根据规则提取链接发送给引擎

如何创建crawlspider
scrapy genspider -t crawl xx xx.com
有些场景使用crawlspider还是比较方便 前提是什么 (url的规律是比较人容易用正则来实现的) [] 
正则表达式一定要写对

案例：
需求：1）进入首页 2）进入详情页获取诗歌名称
代码：（D:\python_spider\day22\ancient_poems\ancient_poems\spiders\poems.py）

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class PoemsSpider(CrawlSpider):
    name = 'poems'
    allowed_domains = ['gushiwen.cn','gushiwen.org']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    rules = (
        Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_[1,2].aspx'), follow=True),
        Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item', follow=True)
    )

    def parse_item(self, response):
        item = {
   
   }
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        gsw_divs = response.xpath('//div[@class="left"]/div[@class="sons"]')

        for gsw_div in gsw_divs:
            title = gsw_div.xpath('.//h1/text()').get()
            print(title)


        return item