scrapy.Spider中close方法的作用

本文详细解析了Scrapy爬虫框架的基本结构和核心组件,包括爬虫类的定义、初始化过程、设置覆盖以及爬虫关闭时的操作。了解这些基础知识对于掌握Scrapy爬虫的开发至关重要。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

在scrapy中,需要实现的爬虫类都需要继承scrapy.Spider类,其中的源码解析:

class Spider(object_ref):
    name = None #爬虫的名字,spider的名字定义了Scrapy如何定位(并初始化)spider
    custom_settings = None #当启动spider时,该设置将会覆盖项目级的相同设置.

    def __init__(self, name=None, **kwargs):
        # 如果爬虫没有名字,中断后续操作则报错
        if name is not None:
            self.name = name
        elif not getattr(self, 'name', None):
            raise ValueError("%s must have a name" % type(self).__name__)
        self.__dict__.update(kwargs)
        if not hasattr(self, 'start_urls'):
            # python对象或类型通过内置成员__dict__来存储对象的属性和方法等信息
            self.start_urls = [] 
# 是Scrapy用来创建爬虫的类方法。
@classmethod
 def from_crawler(cls, crawler, *args, **kwargs):
    spider = cls(*args, **kwargs)
    spider._set_crawler(crawler)
    return spider
  • 参数:
  • crawler(Crawlerinstance) - 爬虫将绑定到的爬虫
  • args(list) - 传递给init()方法的参数
  • kwargs(dict) - 传递给init()方法的关键字参数
# 静态方法,当spider关闭时,该函数被调用。
@staticmethod
def close(spider, reason):
    closed = getattr(spider, 'closed', None)
    if callable(closed):
        return closed(reason)

从close的源码看出,如果需要在爬虫结束的时候进行一些操作,那么就可以通过改写该方法,或者在编写的爬虫类中实现closed方法即可。

 

欢迎关注公众号:日常bug,每天写至少一篇技术文章,每天进步一点点。

PS D:\conda_Test\baidu_spider\baidu_spider> scrapy crawl baidu -o realtime.csv 2025-06-26 20:37:39 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: baidu_spider) 2025-06-26 20:37:39 [scrapy.utils.log] INFO: Versions: lxml 5.2.1.0, libxml2 2.13.1, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 23.10.0, Python 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 24.2.1 (OpenSSL 3.0.16 11 Feb 2025), cryptography 43.0.0, Platform Windows-11-10.0.22631-SP0 2025-06-26 20:37:39 [scrapy.addons] INFO: Enabled addons: [] 2025-06-26 20:37:39 [asyncio] DEBUG: Using selector: SelectSelector 2025-06-26 20:37:39 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2025-06-26 20:37:39 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2025-06-26 20:37:39 [scrapy.extensions.telnet] INFO: Telnet Password: 40e94de686f0a93d 2025-06-26 20:37:39 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2025-06-26 20:37:39 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'baidu_spider', 'FEED_EXPORT_ENCODING': 'utf-8', 'NEWSPIDER_MODULE': 'baidu_spider.spiders', 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['baidu_spider.spiders'], 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'} 2025-06-26 20:37:40 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2025-06-26 20:37:40 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2025-06-26 20:37:40 [scrapy.middleware] INFO: Enabled item pipelines: ['baidu_spider.pipelines.BaiduSpiderPrintPipeline', 'baidu_spider.pipelines.BaiduSpiderPipeline'] 2025-06-26 20:37:40 [scrapy.core.engine] INFO: Spider opened 2025-06-26 20:37:40 [scrapy.core.engine] INFO: Closing spider (shutdown) 2025-06-26 20:37:40 [baidu] INFO: 执行了close_spider方法,项目已经关闭 2025-06-26 20:37:40 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method CoreStats.spider_closed of <scrapy.extensions.corestats.CoreStats object at 0x000001BB483C0470>> Traceback (most recent call last): File "D:\anaconda3\Lib\site-packages\scrapy\crawler.py", line 160, in crawl yield self.engine.open_spider(self.spider, start_requests) NameError: name 'baidu_spider' is not defined During handling of the above exception, another exception occurred: Traceback (most recent call last): File "D:\anaconda3\Lib\site-packages\scrapy\utils\defer.py", line 348, in maybeDeferred_coro result = f(*args, **kw) File "D:\anaconda3\Lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "D:\anaconda3\Lib\site-packages\scrapy\extensions\corestats.py", line 30, in spider_closed elapsed_time = finish_time - self.start_time TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'NoneType' 2025-06-26 20:37:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 9} 2025-06-26 20:37:40 [scrapy.core.engine] INFO: Spider closed (shutdown) Unhandled error in Deferred: 2025-06-26 20:37:40 [twisted] CRITICAL: Unhandled error in Deferred: Traceback (most recent call last): File "D:\anaconda3\Lib\site-packages\twisted\internet\defer.py", line 874, in callback self._startRunCallbacks(result) File "D:\anaconda3\Lib\site-packages\twisted\internet\defer.py", line 981, in _startRunCallbacks self._runCallbacks() File "D:\anaconda3\Lib\site-packages\twisted\internet\defer.py", line 1075, in _runCallbacks current.result = callback( # type: ignore[misc] File "D:\anaconda3\Lib\site-packages\twisted\internet\defer.py", line 1946, in _gotResultInlineCallbacks _inlineCallbacks(r, gen, status, context) --- <exception caught here> --- File "D:\anaconda3\Lib\site-packages\twisted\internet\defer.py", line 2000, in _inlineCallbacks result = context.run(gen.send, result) File "D:\anaconda3\Lib\site-packages\scrapy\crawler.py", line 160, in crawl yield self.engine.open_spider(self.spider, start_requests) builtins.NameError: name 'baidu_spider' is not defined 2025-06-26 20:37:40 [twisted] CRITICAL: Traceback (most recent call last): File "D:\anaconda3\Lib\site-packages\twisted\internet\defer.py", line 2000, in _inlineCallbacks result = context.run(gen.send, result) File "D:\anaconda3\Lib\site-packages\scrapy\crawler.py", line 160, in crawl yield self.engine.open_spider(self.spider, start_requests) NameError: name 'baidu_spider' is not defined PS D:\conda_Test\baidu_spider\baidu_spider> 如何解决
最新发布
06-27
2025-06-23 20:53:46 [scrapy.utils.log] INFO: Scrapy 2.13.2 started (bot: scrapybot) 2025-06-23 20:53:46 [scrapy.utils.log] INFO: Versions: {'lxml': '5.4.0', 'libxml2': '2.11.9', 'cssselect': '1.3.0', 'parsel': '1.10.0', 'w3lib': '2.3.1', 'Twisted': '25.5.0', 'Python': '3.13.1 (tags/v3.13.1:0671451, Dec 3 2024, 19:06:28) [MSC v.1942 ' '64 bit (AMD64)]', 'pyOpenSSL': '25.1.0 (OpenSSL 3.5.0 8 Apr 2025)', 'cryptography': '45.0.4', 'Platform': 'Windows-11-10.0.26100-SP0'} 2025-06-23 20:53:46 [scrapy.addons] INFO: Enabled addons: [] 2025-06-23 20:53:46 [asyncio] DEBUG: Using selector: SelectSelector 2025-06-23 20:53:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2025-06-23 20:53:46 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2025-06-23 20:53:46 [scrapy.extensions.telnet] INFO: Telnet Password: 3325561fdb142f54 2025-06-23 20:53:46 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2025-06-23 20:53:46 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 2, 'NEWSPIDER_MODULE': 'xinwenScrapy.spiders', 'SPIDER_MODULES': ['xinwenScrapy.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'} 2025-06-23 20:53:46 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2025-06-23 20:53:46 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.start.StartSpiderMiddleware', 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2025-06-23 20:53:46 [scrapy.middleware] INFO: Enabled item pipelines: ['xinwenScrapy.pipelines.XinwenPipeline'] 2025-06-23 20:53:46 [scrapy.core.engine] INFO: Spider opened 2025-06-23 20:53:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-06-23 20:53:46 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2025-06-23 20:53:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cloud.inspur.com/about-inspurcloud/about-us/news/index.html> (referer: None) 2025-06-23 20:53:47 [scrapy.core.engine] INFO: Closing spider (finished) 2025-06-23 20:53:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 339, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 39970, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 1.126639, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2025, 6, 23, 12, 53, 47, 739284, tzinfo=datetime.timezone.utc), 'httpcompression/response_bytes': 554702, 'httpcompression/response_count': 1, 'items_per_minute': 0.0, 'log_count/DEBUG': 4, 'log_count/INFO': 10, 'response_received_count': 1, 'responses_per_minute': 60.0, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2025, 6, 23, 12, 53, 46, 612645, tzinfo=datetime.timezone.utc)} 2025-06-23 20:53:47 [scrapy.core.engine] INFO: Spider closed (finished) (.venv) PS C:\Users\Lenovo\PycharmProjects\PythonProject10\xinwenScrapy> cd ~ (.venv) PS C:\Users\Lenovo> cd PycharmProjects (.venv) PS C:\Users\Lenovo\PycharmProjects> cd PythonProject10 (.venv) PS C:\Users\Lenovo\PycharmProjects\PythonProject10> scrapy startproject news_crawler New Scrapy project 'news_crawler', using template directory 'C:\Users\Lenovo\PycharmProjects\PythonProject10\.venv\Lib\site-packages\scrapy\templates\project', created in: C:\Users\Lenovo\PycharmProjects\PythonProject10\news_crawler You can start your first spider with: cd news_crawler scrapy genspider example example.com (.venv) PS C:\Users\Lenovo\PycharmProjects\PythonProject10> cd news_crawler (.venv) PS C:\Users\Lenovo\PycharmProjects\PythonProject10\news_crawler> cd .. (.venv) PS C:\Users\Lenovo\PycharmProjects\PythonProject10> python xinwen_spider.py C:\Users\Lenovo\PycharmProjects\PythonProject10\.venv/Scripts\python.exe: can't open file 'C:\\Users\\Lenovo\\PycharmProjects\\PythonProject10\\xinwen_spider.py': [Errno 2] No such file or directory (.venv) PS C:\Users\Lenovo\PycharmProjects\PythonProject10> scrapy crawl xinwen_spider Scrapy 2.13.2 - no active project The crawl command is not available from this location. These commands are only available from within a project: check, crawl, edit, list, parse. Use "scrapy" to see available commands 解析日志并修改代码
06-24
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值