scrapy-pipeline、item、shell

最新推荐文章于 2024-08-01 19:45:58 发布

原创最新推荐文章于 2024-08-01 19:45:58 发布 · 490 阅读

2 ·

CC 4.0 BY-SA版权

pipeline的介绍

从pipline的字典形式可以看出来，pipeline可以有多个，而且确实pieplin能够定义多个

为什么需要多个pipeline:
1.可能会有多个spider，不同的spider处理不同的item的内容
2.一个spider的内容可能要做不同的操作，比如存入不同的数据库中

注意：
1.pipeline的权重越小优先级越高
2.pipeline中process_item方法名不能修改为其他的名称

logging 模块的使用

scrapy
- settings中设置LOG_LEVEL=“WARNING”
- settings中设置LOG_FILE="./a.log" #设置日志保存的位置，设置会后终端不会显示日志内容
- import logging,实例化logger的方式在任何文件中使用logger输出内容
普通项目中
- import logging
- logging.basicConfig(…) #设置日志输出的样式，格式
- 实例化一个logger=logging.getLogger(__name__)
- 在任何py文件中调用logger即可

实现翻页请求

# -*- coding: utf-8 -*-
import scrapy


class HrSpider(scrapy.Spider):
    name = 'hr'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/search.html']

    def parse(self, response):
        re_list = response.xpath("//div[@class='recruit-list']/a")
        for re in re_list:
            item = dict()
            item["title"] = re.xpath("./h4/text()").extract_first()
            yield item

        # 找到下一页的url地址
        next_url = response.xpath("//li[@class='next']").extract_first()
        if next_url != "next disabled":
            yield scrapy.Request(
                next_url,
                callback= self.parse()
            )

scrapy.Request知识点：
scrapy.Request(url[callback,method=‘GET’,headers,body,cookies,meta,dont_filter=False])
注：一般2文档中方括号中的参数表示可选参数

scrapy.Request常用参数为：
callback:指定传入的url交给哪个解析函数去处理
meta：实现不同的解析函数中传递数据，mata默认会携带部分信息，比如下载延迟，请求深度等
dont_filter:让scrapy的去重不会过滤当前url,scrapy默认有url去重的功能，对需要重复请求的url有重要用途

item的介绍和使用

在获取到数据的时候，使用不同的item来存放不同的数据
在把数据交给pipeline的时候，可以通过isinstance(item,MySpider)来判断数据是属于哪个item，进行不同的数据（item）处理

阳光政务平台爬虫

# -*- coding: utf-8 -*-
import scrapy
from yangguang.items import YangguangItem
import json


class YgSpider(scrapy.Spider):
    name = 'yg'
    allowed_domains = ['sun0769.com']
    start_urls = ['http://d.wz.sun0769.com/index.php/question/huiyin']

    def parse(self, response):
        # 分组
        tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]/tr")
        for tr in tr_list:
            item = YangguangItem()
            item["title"] = tr.xpath("./td[3]/a[@class='news14']/text()").extract_first()
            item["href"] = tr.xpath("./td[3]/a[@class='news14']/@href").extract_first()
            item["publish"] = tr.xpath("./td[6]/text()").extract_first()

            yield scrapy.Request(
                item["href"],
                callback=self.parse_detail,
                meta={"item": item}
            )
        #  翻页
        next_url = response.xpath("//a[text()='>']/@href").extract_first()
        if next_url is not None:
            yield scrapy.Request(
                next_url,
                callback=self.parse
            )

    def parse_detail(self, response):  # 处理详情页
        item = response.meta["item"]
        item["content"] = response.xpath("//div[@class='txt16_3']//text()").extract()
        item["content_img"] = response.xpath("//div[@class='txt16_3']//img/@src").extract()
        item["content_img"] = ["http://wz.sun0769.com" + i for i in item["content_img"]]
        print(item)

item的设置

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class YangguangItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    href = scrapy.Field()
    publish = scrapy.Field()
    content_img = scrapy.Field()
    content = scrapy.Field()

scrapy shell用法

Scrapy终端是一个交互终端，供您在未启动spider的情况下尝试及调试您的爬取代码。其本意是用来测试提取数据的代码，不过您可以将其作为正常的Python终端，在上面测试任何的Python代码。

该终端是用来测试XPath或CSS表达式，查看他们的工作方式及从爬取的网页中提取的数据。在编写您的spider时，该终端提供了交互性测试您的表达式代码的功能，免去了每次修改后运行spider的麻烦。

一旦熟悉了Scrapy终端后，您会发现其在开发和调试spider时发挥的巨大作用。

如果您安装了 IPython ，Scrapy终端将使用 IPython (替代标准Python终端)。 IPython 终端与其他相比更为强大，提供智能的自动补全，高亮输出，及其他特性。

使用方法：
scrapy shell < url >：启动
response.request.url: 当前响应对应的请求的url地址
response.headers: 响应头
response.body: 响应体也就是html代码，默认是byte类型
response.request.headers: 当前响应的请求头