2021SC@SDUSC
part3:将附加数据传递给回调函数
The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument.
请求的回调函数是在下载请求响应时调用的函数。 回调函数将使用下载的 Response 对象作为其第一个参数被调用。
举个栗子:
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
在某些情况下,我们可能需要将参数传递给这些回调函数,以便稍后在第二个回调中接收参数。 这个栗子显示了如何使用 Request.cb_kwargs 属性实现此目的:
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request
def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)
part4:Using errbacks to catch exceptions in request processing
在请求处理中使用 errbacks 捕获异常:
The errback of a request is a function that will be called when an exception is raise while processing it.
It receives a Failure as first parameter and can be used to track connection establishment timeouts, DNS errors etc.
请求的 errback 是一个函数,在处理它时引发异常时将调用该函数。
它接收一个failure作为第一个参数,可用于跟踪连接建立超时、DNS 错误等。
来一个记录所有错误并在需要时捕获一些特定错误的栗子:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
本文详细介绍了Scrapy框架中如何使用回调函数处理请求成功响应,并通过errback捕获和处理请求过程中可能出现的各种异常,包括HTTP错误、DNS错误和连接超时。实例演示了如何传递参数和针对不同错误类型定制处理策略。
1521

被折叠的 条评论
为什么被折叠?



