进程池的内部数据结构及不同的线程

最新推荐文章于 2024-09-04 20:29:48 发布

转载最新推荐文章于 2024-09-04 20:29:48 发布 · 354 阅读

个人专栏收录该内容

41 篇文章

订阅专栏

本文深入解析了Python中multiprocessing库的Pool类，介绍了其内部结构和工作原理，包括任务分配、结果回收等过程。并对比了Pool.apply和Pool.apply_async的不同应用场景。

Pool类可以提供指定数量的进程供用户调用，当有新的请求提交到Pool中时，如果池还没有满，就会创建一个新的进程来执行请求。如果池满，请求就会告知先等待，直到池中有进程结束，才会创建新的进程来执行这些请求。

Pool函数返回的进程池对象中有下面一些数据结构：

self._inqueue  接收任务队列（SimpleQueue），用于主进程将任务发送给worker进程
self._outqueue  发送结果队列（SimpleQueue），用于worker进程将结果发送给主进程
self._taskqueue  同步的任务队列，保存线程池分配给主进程的任务
self._cache = {}  任务缓存
self._processes  worker进程个数
self._pool = []  woker进程队列

进程池工作时，任务的接收、分配。结果的返回，均由进程池内部的各个线程合作完成，来看看进程池内部有那些线程：

_work_handler线程，负责保证进程池中的worker进程在有退出的情况下，创建出新的worker进程，并添加到进程队列（pools）中，保持进程池中的worker进程数始终为processes个。_worker_handler线程回调函数为Pool._handler_workers方法，在进程池state==RUN时，循环调用_maintain_pool方法，监控是否有进程退出，并创建新的进程，append到进程池pools中，保持进程池中的worker进程数始终为processes个。

_task_handler线程，负责从进程池中的task_queue中，将任务取出，放入接收任务队列（Pipe）

_handle_results线程，负责将处理完的任务结果，从outqueue（Pipe）中读取出来，放在任务缓存cache中

_terminate，这里的_terminate并不是一个线程，而是一个Finalize对象

进程池中的数据结构、各个线程之间的合作关系如下图所示：

The multiprocessing.Pool modules tries to provide a similar interface.

Pool.apply is like Python apply, except that the function call is performed in a separate process. Pool.apply blocks until the function is completed.

Pool.apply_async is also like Python's built-in apply, except that the call returns immediately instead of waiting for the result. An ApplyResult object is returned. You call its get() method to retrieve the result of the function call. The get() method blocks until the function is completed. Thus, pool.apply(func, args, kwargs) is equivalent to pool.apply_async(func, args, kwargs).get().

In contrast to Pool.apply, the Pool.apply_async method also has a callback which, if supplied, is called when the function is complete. This can be used instead of calling get().

If you want the Pool of worker processes to perform many function calls asynchronously, use Pool.apply_async. The order of the results is not guaranteed to be the same as the order of the calls to Pool.apply_async.

Notice also that you could call a number of different functions with Pool.apply_async (not all calls need to use the same function).

In contrast, Pool.map applies the same function to many arguments. However, unlike Pool.apply_async, the results are returned in an order corresponding to the order of the arguments.

recommend map_async for three reasons:

It's cleaner looking code. This:

pool = Pool(processes=proc_num)
async_result = pool.map_async(post_processing_0.main, split_list)
pool.close()
pool.join()

looks nicer than this:

pool = Pool(processes=proc_num)
P={}
for i in range(0,proc_num):
    P['process_'+str(i)]=pool.apply_async(post_processing_0.main, [split_list[i]])
pool.close()
pool.join()

With apply_async, if an exception occurs inside of post_processing_0.main, you won't know about it unless you explicitly call P['process_x'].get() on the failing AsyncResultobject, which would require iterating over all of P. With map_async the exception will be raised if you call async_result.get() - no iteration required.
map_async has built-in chunking functionality, which will make your code perform noticeably better if split_list is very large.