Scarpy源码分析 14 Requests and Responses Ⅲ

最新推荐文章于 2024-10-26 23:52:55 发布

原创最新推荐文章于 2024-10-26 23:52:55 发布 · 695 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python

2021SC@SDUSC 专栏收录该内容

20 篇文章

订阅专栏

本文详细介绍了Scrapy框架中如何使用回调函数处理请求成功响应，并通过errback捕获和处理请求过程中可能出现的各种异常，包括HTTP错误、DNS错误和连接超时。实例演示了如何传递参数和针对不同错误类型定制处理策略。

2021SC@SDUSC

part3：将附加数据传递给回调函数

The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument.

请求的回调函数是在下载请求响应时调用的函数。回调函数将使用下载的 Response 对象作为其第一个参数被调用。

举个栗子：

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

在某些情况下，我们可能需要将参数传递给这些回调函数，以便稍后在第二个回调中接收参数。这个栗子显示了如何使用 Request.cb_kwargs 属性实现此目的：

def parse(self, response):
    request = scrapy.Request('http://www.example.com/index.html',
                             callback=self.parse_page2,
                             cb_kwargs=dict(main_url=response.url))
    request.cb_kwargs['foo'] = 'bar'  # add more arguments for the callback
    yield request

def parse_page2(self, response, main_url, foo):
    yield dict(
        main_url=main_url,
        other_url=response.url,
        foo=foo,
    )

part4：Using errbacks to catch exceptions in request processing

在请求处理中使用 errbacks 捕获异常：

The errback of a request is a function that will be called when an exception is raise while processing it.

It receives a Failure as first parameter and can be used to track connection establishment timeouts, DNS errors etc.

请求的 errback 是一个函数，在处理它时引发异常时将调用该函数。

它接收一个failure作为第一个参数，可用于跟踪连接建立超时、DNS 错误等。

来一个记录所有错误并在需要时捕获一些特定错误的栗子：

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)