文章目录
Scrapy主要有2个中间组件,分别是Spider Middleware与Download Middleware,这个两类中间组件都可以自定义多种中间组件。每个组件都需要写成一个类的形式。
Downloader Middleware组件[DM]
如何使自定义的DM生效
为了使得自己定义的DM生效,首先需要在settings.py文件它。例如我们定义一个UserAgentMiddleware组件,然后在settyings.py文件中,
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.UserAgentMiddleware': 543,
}
当启动项目时,系统自动将自己定义的UserAgentMiddleware中间件加入到已系统默认中间件列表中。系统默认的中间组件为:
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
DM排序规则
中间组件之间的排序规则是通过它们设置值的大小排序,对于下载器中间组件来说。我们设置的中间件的值越小,排在越靠前,也就是越靠近Scrapy引擎;越大,排序靠后,离下载器组件越近。
关闭项目自带的DM中间件
如果自己想要关闭系统默认带的下载中间件,可以将它的值设为None,例如关闭user-agent中间件:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.UserAgentMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
自定义一个DM
每一个中间组件都是一个python类,这个类中必须包含如下一个或者多个方法:
这里,入参:
request(Request object),
spider(Spider object),
response(Response object),
exception (an Exception object),
crawler (Crawler object)
内置的DM
Spider Middleware[SM]
SM生效
和DM类似。
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
}
内置的SM
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
}