一、分布式爬虫系统设计图
主线程:检查状态、创建任务(给爬虫线程)
爬虫线程:爬取内容
HeartBeat(心跳):维护连接、获取命令 (给主线程检查)
二、Master-Slave结构
三、protocol(协议)
常用的协议设定:
protocol_constants.py
# msg type, could be REGISTER, UNREGISTER and HEARTBEAT
MSG_TYPE = 'TYPE'
# send register
REGISTER = 'REGISTER'
# unregister client with id assigned by master
UNREGISTER = 'UNREGISTER'
# send heart beat to server with id
HEARTBEAT = 'HEARTBEAT'
# notify master paused with id
PAUSED = 'PAUSED'
# notify master resumed with id
RESUMED = 'RESUMED'
# notify master resumed with id
SHUTDOWN = 'SHUTDOWN'
# get a new location list to crawl
LOCATIONS = 'REQUIRE_LOCATION_LIST'
# get a new triple list to crawl
TRIPLES = 'TRIPLES'
DATA = 'DATA'
CRAWL_DELAY = 'CRAWL_DELAY'
# finished list of item
FININSHED_ITEMS = 'FINISHED_ITEMS'
# client id key word
CLIENT_ID = 'CLIENT_ID'
# server status key word
ACTION_REQUIRED = 'ACTION_REQUIRED'
# server require pause
PAUSE_REQUIRED = 'PAUSE_REQUIRED'
# server require pause
RESUME_REQUIRED = 'RESUME_REQUIRED'
# server require shutdown
SHUTDOWN_REQUIRED = 'SHUTDOWN_REQUIRED'
# server status key word
SERVER_STATUS = 'SERVER_STATUS'
# server status values
STATUS_RUNNING = 'STATUS_RUNNING'
STATUS_PAUSED = 'STATUS_PAUSED'
STATUS_SHUTDOWN = 'STATUS_SHUTDOWN'
STATUS_CONNECTION_LOST = 'STATUS_CONNECTION_LOST'
ERROR = 'ERROR'
# client id not found, then it needs to register itself
ERR_NOT_FOUND = 'ERR_NOT_FOUND'
REQUEST_SIZE = 50
四、Socket
Socket通信参考本篇文章
<