各种调用scrapy的方法有很多,比如:
import os
os.system("scrapy crawl SpiderName")
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your spider definition
...
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
但能实现标题所说功能的只有两个,在这里总结一下:
第一种,简单,但类本身在最后一行有句sys.exit(cmd.exitcode),注定了他执行完就退出程序,不再执行后面的语句,所以只适合调试时使用。
from scrapy import cmdline
cmdline.execute("scrapy crawl SpiderName".split())
第二种,相对第一种会多几行代码,但是没有第一种的缺点,建议使用。
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('SpiderName', domain='123.com')
process.start() # the script will block here until the crawling is finished