使用python 调用scrapy的爬虫Spider并且相互之间可以正常传参实现全局

最新推荐文章于 2024-09-17 13:30:28 发布

浩外大叔

最新推荐文章于 2024-09-17 13:30:28 发布

阅读量5.9k

点赞数 7

CC 4.0 BY-SA版权

文章标签： scrapy python

本文链接：https://blog.youkuaiyun.com/hoddy355/article/details/81156328

本文介绍了两种有效的Scrapy爬虫启动方法：一种适用于调试，执行完毕后立即退出；另一种更加灵活，可在爬取完成后继续执行后续代码。文中详细展示了如何通过不同方式配置和启动Scrapy爬虫。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

各种调用scrapy的方法有很多，比如：

import os
os.system("scrapy crawl SpiderName")

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

但能实现标题所说功能的只有两个，在这里总结一下：

第一种，简单，但类本身在最后一行有句sys.exit(cmd.exitcode)，注定了他执行完就退出程序，不再执行后面的语句，所以只适合调试时使用。

from scrapy import cmdline
cmdline.execute("scrapy crawl SpiderName".split())

第二种，相对第一种会多几行代码，但是没有第一种的缺点，建议使用。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
process.crawl('SpiderName', domain='123.com')
process.start() # the script will block here until the crawling is finished