在爬中开发中,进程和线程的概念是非常重要。以下是查找的学习材料以做笔记
1,多进程--使用multiprocessing模块创建多进程
multiprocessing模块提供 了一个Process类来描述一个进程对象。创建子进程时,只需要传入一个执行函数和函数参数,即可完成一个Process实例的创建,用start()方法启动进程,用join()方法实现进程间的同步。
import os
from multiprocessing import Process
#child process execute code
def run_proc(name):
print (f'Child process {name} :{os.getpid()}')
if __name__ == '__main__':
print(f'Parent process {os.getpid()}')
for i in range(5):
p = Process(target = run_proc, args = (str(i),))
print(f'Process will start')
p.start()
print(f'the childProcess {i} is running')
p.join()
print('Process end')
2,multiprocessing模块提供了一个Pool类开代表进程池对象
Pool提供指定数量进程供用户调用,默认大小是CPU的核数,也可心指定。当有新的请求提交到Pool中时,如果池还没有满,那么就会创建一个新的进程用来执行该请求,但如果池中的进程数已经达到规定的最大值,那么该请求就会等待,直到池中有进程结束,
from multiprocessing import Pool
import os,time ,random
def run_task(name) :
print(f'Task {name} ,{os.getpid()} is running...')
time.sleep(random.random()*3)
print(f'Task {name} end.')
if __name__=="__main__" :
print(f'the mainProcess is {os.getpid()}')
run_pool = Pool(processes=2)
for i in range(5) :
run_pool.apply_async(run_task,args=(i,))
print(f'Pool is start :{i}')
run_pool.close()
run_pool.join()
ps:Pool对象调用join()方法会等待所有子进程执行完毕,调用join()之前必须先调用close(),调用close()之后,就不能继续添加新的process了
3,进程间通信————QUEUE 和 pipe
Queue是多进程安全队列,有两个方法:put插入数据到队列,get从队列读取并且删除一个元素
rom multiprocessing import Queue, Process
import random,time,os
def process_write(q,urls):
'''write in queue'''
print(f'the write queue id is {os.getpid()}')
for url in urls:
q.put(url)
print(f'the url is {url}')
time.sleep(random.random())
def process_read(q):
'''read in queue'''
print(f'the read is {os.getpid()}')
while True:
url=q.get(True)
print(f'the read url is {url}')
if __name__=="__main__" :
q = Queue()
write_process1 = Process(target=process_write,args=(q,['url1','url2','url3']))
write_process2 = Process(target=process_write,args=(q,['url4','url5','url6']))
read_process = Process(target=process_read,args=(q,))
write_process1.start()
write_process2.start()
read_process.start()
write_process1.join()
write_process2.join()
read_process.terminate()
4,pipe通信机制
Pipe常用来在两个进程间通信,两个进程分别位于管道的两端。
Pipe方法返回(conn1,conn2)代表一个管道的两个端,Pipe方法有duplex参数,如果duplex为TRUE,即为全双工模式,两端均可以收发。如为False,conn1只负责收,con2只负责发。send,recv方法分别是发和收
import multiprocessing
import random,os,time
def proc_send(pip,urls) :
#print(f'the Process {os.getpid()} send')
for url in urls:
pip.send(url)
print(f'the Process {os.getpid()} send url is {url}')
time.sleep(random.random())
def proc_recv(pip):
#print(f'the process recv{os.getpid()} ')
while True:
print(f'the Process {os.getpid()},{pip.recv()}')
time.sleep(random.random())
#print(f'recv is {re}')
if __name__=="__main__" :
print(f'the main process is {os.getpid()}')
pip = multiprocessing.Pipe()
proccess_send=multiprocessing.Process(target=proc_send,args=(pip[0],['url_'+ str(i) for i in range(10)]))
proccess_recv=multiprocessing.Process(target=proc_recv,args=(pip[1],))
proccess_send.start()
proccess_recv.start()
proccess_send.join()
proccess_recv.join()
print(f'the main process is over')
5,多线程
应用场景:运行时间长的任务放后台,需要等待的任务实现上,如网络收发数据。
两种方式创建多线程,第一种把一个函数传入并创建Thread实例,再调用start,
第二种直接继承threading.Thread。重写__init__方法和run方法
import threading
import os,time,random
def threading_run(urls):
print(f'the threading name is {threading.current_thread().name}---{os.getpid()}')
for url in urls :
print(f'the the {threading.current_thread().name} is {url}')
time.sleep(random.random())
print(f'the threading end {threading.current_thread().name}')
t1 = threading.Thread(target=threading_run,name='t1',args=(['url_1','url_2','url_3'],))
t2 = threading.Thread(target=threading_run,name='t2', args=(['url_4','url_5','url_6'],))
t1.start()
t2.start()
t1.join()
t2.join()
第二种继承
import threading
import time,random
class MyThread(threading.Thread):
def __init__(self, name,urls) -> None:
threading.Thread.__init__(self, name=name )
self.urls=urls
def run(self):
print(f'the thread name is {threading.current_thread().name}')
for url in self.urls :
print(f'the threading name is {threading.current_thread().name},the url is {url}')
time.sleep(random.random())
print(f'the threading {threading.current_thread().name} is ended')
t1 = MyThread(name='t1',urls=['url_1','url_2','url_3'])
t2 = MyThread(name='t2', urls=['url_4','url_5','url_6'])
t1.start()
t2.start()
t1.join()
t2.join()
参考文献:
1,Python 的 Gevent --- 高性能的 Python 并发框架-优快云博客