Scrapy 一个进程运行多个爬虫

首先创建相应的爬虫

spider 爬虫文件修改成这样

Python
# -*- coding: utf-8 -*- import <span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/scrapy" title="View all posts in scrapy" target="_blank">scrapy</a></span> class Seo2Spider(<span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/scrapy" title="View all posts in scrapy" target="_blank">scrapy</a></span>.Spider): name = 'seo2' allowed_domains = ['www.168seo.cn'] start_urls = ['http://www.168seo.cn/'] def parse(self, response): print(response.css('title::text').extract_first()) # -*- coding: utf-8 -*- import <span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/scrapy" title="View all posts in scrapy" target="_blank">scrapy</a></span> class SeoSpider(scrapy.Spider): name = 'seo' allowed_domains = ['www.168seo.cn'] start_urls = ['http://www.168seo.cn/'] def parse(self, response): print(response.css('title::text').extract_first())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# -*- coding: utf-8 -*-
import scrapy
 
 
class Seo2Spider ( scrapy . Spider ) :
     name = 'seo2'
     allowed_domains = [ 'www.168seo.cn' ]
     start_urls = [ 'http://www.168seo.cn/' ]
 
     def parse ( self , response ) :
         print ( response . css ( 'title::text' ) . extract_first ( ) )
 
 
# -*- coding: utf-8 -*-
import scrapy
 
 
class SeoSpider ( scrapy . Spider ) :
     name = 'seo'
     allowed_domains = [ 'www.168seo.cn' ]
     start_urls = [ 'http://www.168seo.cn/' ]
 
     def parse ( self , response ) :
         print ( response . css ( 'title::text' ) . extract_first ( ) )

创建一个新的脚本main.py

main.py 文件代码如下:

Python
# -*- coding: utf-8 -*- """ @Time: 2018/1/22 @Author: songhao @微信公众号: zeropython @File: main.py """ import scrapy from scrapy.crawler import CrawlerProcess class SeoSpider(scrapy.Spider): name = 'seo' allowed_domains = ['www.168seo.cn'] start_urls = ['http://www.168seo.cn/'] def parse(self, response): print(response.css('title::text').extract_first()) class Seo2Spider(scrapy.Spider): name = 'seo2' allowed_domains = ['www.168seo.cn'] start_urls = ['http://www.168seo.cn/'] def parse(self, response): print(response.css('title::text').extract_first()) process = CrawlerProcess() process.crawl(SeoSpider) process.crawl(Seo2Spider) process.start() # the script will block here until all crawling jobs are finished
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# -*- coding: utf-8 -*-
"""
@Time: 2018/1/22
@Author: songhao
@微信公众号: zeropython
@File: main.py
"""
import scrapy
from scrapy . crawler import CrawlerProcess
class SeoSpider ( scrapy . Spider ) :
name = 'seo'
allowed_domains = [ 'www.168seo.cn' ]
start_urls = [ 'http://www.168seo.cn/' ]
 
def parse ( self , response ) :
print ( response . css ( 'title::text' ) . extract_first ( ) )
 
class Seo2Spider ( scrapy . Spider ) :
name = 'seo2'
allowed_domains = [ 'www.168seo.cn' ]
start_urls = [ 'http://www.168seo.cn/' ]
 
def parse ( self , response ) :
print ( response . css ( 'title::text' ) . extract_first ( ) )
 
process = CrawlerProcess ( )
process . crawl ( SeoSpider )
process . crawl ( Seo2Spider )
process . start ( ) # the script will block here until all crawling jobs are finished

 

 

效果如下:

方法二:

Python
# -*- coding: utf-8 -*- """ @Time: 2018/1/22 @Author: songhao @微信公众号: zeropython @File: main.py """ import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging # 爬虫1 class SeoSpider(scrapy.Spider): name = 'seo' allowed_domains = ['www.168seo.cn'] start_urls = ['http://www.168seo.cn/'] def parse(self, response): print(response.css('title::text').extract_first()) # 爬虫2 class Seo2Spider(scrapy.Spider): name = 'seo2' allowed_domains = ['www.168seo.cn'] start_urls = ['http://www.168seo.cn/'] def parse(self, response): print(response.css('title::text').extract_first()) configure_logging() runner = CrawlerRunner() runner.crawl(SeoSpider) runner.crawl(Seo2Spider) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# -*- coding: utf-8 -*-
"""
@Time: 2018/1/22
@Author: songhao
@微信公众号: zeropython
@File: main.py
"""
import scrapy
from twisted . internet import reactor
from scrapy . crawler import CrawlerRunner
from scrapy . utils . log import configure_logging
 
# 爬虫1
class SeoSpider ( scrapy . Spider ) :
     name = 'seo'
     allowed_domains = [ 'www.168seo.cn' ]
     start_urls = [ 'http://www.168seo.cn/' ]
 
     def parse ( self , response ) :
         print ( response . css ( 'title::text' ) . extract_first ( ) )
 
# 爬虫2
class Seo2Spider ( scrapy . Spider ) :
     name = 'seo2'
     allowed_domains = [ 'www.168seo.cn' ]
     start_urls = [ 'http://www.168seo.cn/' ]
 
     def parse ( self , response ) :
         print ( response . css ( 'title::text' ) . extract_first ( ) )
 
 
configure_logging ( )
runner = CrawlerRunner ( )
runner . crawl ( SeoSpider )
runner . crawl ( Seo2Spider )
d = runner . join ( )
d . addBoth ( lambda _ : reactor . stop ( ) )
 
reactor . run ( ) # the script will block here until all crawling jobs are finished

效果和第一个差不多




  • zeropython 微信公众号 5868037 QQ号 5868037@qq.com QQ邮箱
Scrapy一个为了爬取网站数据、提取结构性数据的应用框架,编写在Python 3.5+版本中。它是一个快速、高层次的屏幕抓取和网络爬取框架。在Scrapy中,你可以通过命令行来启动一个爬虫,但如果你想要同时启动多个爬虫,你需要在项目中进行一些特定的配置和编程工作。 首先,你需要为每个爬虫定义一个单独的入口点,通常是通过在Scrapy项目中创建多个爬虫文件来实现。每个爬虫文件中定义了一个爬虫类,这个类继承自`scrapy.Spider`。例如,你可以有`spider_a.py`和`spider_b.py`两个爬虫文件,分别定义了`SpiderA`和`SpiderB`两个爬虫类。 其次,在启动爬虫时,你可以使用Scrapy的命令行工具,并通过`-a`选项传递参数来指定启动哪个爬虫。例如,要启动`SpiderA`爬虫,你可以运行: ``` scrapy crawl SpiderA ``` 要启动`SpiderB`爬虫,你可以运行: ``` scrapy crawl SpiderB ``` 如果你想要同时启动多个爬虫,可以使用操作系统提供的并行执行工具,如在Unix-like系统中可以使用`&`符号或者`xargs`命令来并行执行多个命令。 例如,使用`&`符号: ``` scrapy crawl SpiderA & scrapy crawl SpiderB & ``` 或者使用`xargs`: ``` echo "SpiderA SpiderB" | xargs -n 1 -P 2 scrapy crawl ``` 这里`-P 2`表示同时启动两个进程。 需要注意的是,并行执行爬虫可能会消耗大量的系统资源,包括网络带宽、CPU和内存等,因此在实际使用中应根据服务器的负载能力和目标网站的规则谨慎操作,以免对目标网站造成过大压力。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值