爬虫repo地址:https://github.com/Karmenzind/EasyGoSpider
此处需求为:
- 返回json中带有
{"code": 0}
时,将此请求加入重试队列 - 假如json中含有cookie被禁信息,对cookie列表进行修正
源码注释中有一句:
Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages.
继而根据Scrapy doc对通用Download Middleware中process_response的介绍:
If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().
返回request对象时,该response不会再进入spider.parse_item方法,因而无需考虑在后者中的处理。
最后修改结果为:
class LocalRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
# customiz' here
resp_dct = json.loads(response.body)
if resp_dct.get('code') != 0:
reason = "Code is not 0."
if resp_dct.get("data") == "\\u8be5\\u7528\\u6237\\u8bbf\\u95ee\\u6b21\\u6570\\u8fc7\\u591a".decode(
'unicode_escape'): # 访问次数过多
banned_cookie = response.request.cookies
reason = "%s has been BANNED today." % banned_cookie
spider.logger.warning(reason)
spider.cookies.remove(banned_cookie)
mongo_cli.cookies.find_one_and_update({"cookie": banned_cookie},
{"$set": {"FailedDate": str(datetime.date.today())}})
return self._retry(request, reason, spider) or response
return response
settings中激活
DOWNLOADER_MIDDLEWARES = {
"scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware": None,
"EasyGoSpider.middleware.LocalRetryMiddleWare": 302
}
另外需要修改Retry time和Filter规则,此前已经设置,忽略。