python爬虫进阶(六):分布式系统设计

本文详细介绍了Python分布式爬虫系统的设计,包括Master-Slave结构、协议设定、Socket通信、心跳机制以及串行化处理。通过心跳维持连接、协议确保通信、Socket实现客户端与服务器端交互,同时探讨了数据一致性和批量处理策略,旨在提高爬取效率和系统稳定性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、分布式爬虫系统设计图





主线程:检查状态、创建任务(给爬虫线程)

爬虫线程:爬取内容

HeartBeat(心跳):维护连接、获取命令 (给主线程检查)



二、Master-Slave结构





三、protocol(协议)

常用的协议设定:

protocol_constants.py


# msg type, could be REGISTER, UNREGISTER and HEARTBEAT
MSG_TYPE	= 'TYPE'

# send register
REGISTER 	= 'REGISTER'

# unregister client with id assigned by master
UNREGISTER 	= 'UNREGISTER'

# send heart beat to server with id
HEARTBEAT	= 'HEARTBEAT'

# notify master paused with id
PAUSED 		= 'PAUSED'

# notify master resumed with id
RESUMED		= 'RESUMED'

# notify master resumed with id
SHUTDOWN		= 'SHUTDOWN'

# get a new location list to crawl
LOCATIONS		= 'REQUIRE_LOCATION_LIST'

# get a new triple list to crawl
TRIPLES  	= 'TRIPLES'

DATA = 'DATA'

CRAWL_DELAY = 'CRAWL_DELAY'

# finished list of item
FININSHED_ITEMS = 'FINISHED_ITEMS'

# client id key word
CLIENT_ID 	= 'CLIENT_ID'

# server status key word
ACTION_REQUIRED	= 'ACTION_REQUIRED'

# server require pause
PAUSE_REQUIRED	= 'PAUSE_REQUIRED'

# server require pause
RESUME_REQUIRED	= 'RESUME_REQUIRED'

# server require shutdown
SHUTDOWN_REQUIRED	= 'SHUTDOWN_REQUIRED'

# server status key word
SERVER_STATUS	= 'SERVER_STATUS'

# server status values
STATUS_RUNNING	= 'STATUS_RUNNING'

STATUS_PAUSED 	= 'STATUS_PAUSED'

STATUS_SHUTDOWN	= 'STATUS_SHUTDOWN'

STATUS_CONNECTION_LOST	= 'STATUS_CONNECTION_LOST'

ERROR 	= 'ERROR'

# client id not found, then it needs to register itself
ERR_NOT_FOUND	= 'ERR_NOT_FOUND'

REQUEST_SIZE    = 50


四、Socket

Socket通信参考本篇文章


<

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值