分布式进程指的是将Process进程分布的多台机器上,充分利用多台机器的性能完成复杂的任务,我们可以将这点应用的分布式爬虫的开发中。
我们举个简单例子,服务进程用来设置任务在task_queue,并设置接口。任务进程调用相同的接口,执行任务,结果写进result queue
taskManager.py : 服务进程
from multiprocessing.managers import BaseManager
from multiprocessing import freeze_support
import queue
#任务个数
task_number = 10
#定义收发队列
task_queue = queue.Queue(task_number)
result_queue = queue.Queue(task_number)
def get_task():
return task_queue
def get_result():
return result_queue
#创建类似的QueenManager
class QueueManager(BaseManager):
pass
def win_run():
QueueManager.register('get_task_queue',callable=get_task)
QueueManager.register('get_result_queue',callable=get_result)
manager = QueueManager(address = ('127.0.0.1',8001),authkey = 'qiye')
manager.start()
try:
task = manager.get_task_queue()
result = manager.get_result_queue()
for url in ["ImageUrl_" + str(i) for i in range(10)]:
print('put task %s....' % url)
task.put(url)
print('try get result....')
for i in range(10):
print('result is %s' % result.get(timeout=10))
except:
print('Manager error')
finally:
manager.shutdown()
if __name__ == '__main__':
freeze_support()
win_run()
taskWorker.py: 任务进程
import time
from multiprocessing.managers import BaseManager
class QueueManager(BaseManager):
pass
QueueManager.register('get_task_queue')
QueueManager.register('get_result_queue')
server_addr = '127.0.0.1'
print('Connect to server %s..' % server_addr)
m = QueueManager(address=(server_addr,8001),authkey='qiye')
m.connect()
task = m.get_task_queue()
result = m.get_result_queue()
while(not task.empty()):
image_url = task.get(True,timeout=5)
print('run task download %s ....' % image_url)
time.sleep(1)
result.put('%s----->success' % image_url)
print ('work exit.')
先执行服务进程, 任务被放进 task_queue:
put task ImageUrl_0....
put task ImageUrl_1....
put task ImageUrl_2....
put task ImageUrl_3....
put task ImageUrl_4....
put task ImageUrl_5....
put task ImageUrl_6....
put task ImageUrl_7....
put task ImageUrl_8....
put task ImageUrl_9....
try get result....
服务进程还在执行时,运行任务进程,
Connect to server 127.0.0.1..
run task download ImageUrl_0 ....
run task download ImageUrl_1 ....
run task download ImageUrl_2 ....
run task download ImageUrl_3 ....
run task download ImageUrl_4 ....
run task download ImageUrl_5 ....
run task download ImageUrl_6 ....
run task download ImageUrl_7 ....
run task download ImageUrl_8 ....
run task download ImageUrl_9 ....
work exit.
任务进程结束后,可以看到数据被写入result_queue:
result is ImageUrl_0----->success
result is ImageUrl_1----->success
result is ImageUrl_2----->success
result is ImageUrl_3----->success
result is ImageUrl_4----->success
result is ImageUrl_5----->success
result is ImageUrl_6----->success
result is ImageUrl_7----->success
result is ImageUrl_8----->success
result is ImageUrl_9----->success
本文介绍了一个简单的分布式进程管理示例,通过将任务分配到多个进程并收集结果,展示了如何利用多台机器的性能来提高复杂任务的处理效率。具体包括服务进程与任务进程的交互流程,以及使用Python的multiprocessing模块实现的代码示例。
355

被折叠的 条评论
为什么被折叠?



