Python 多线程与多进程入门指南_python 什么时候使用线程池什么时候使用进程池-优快云博客

快速入门 Python concurrent.futures

concurrent.futures 是 Python 中用于实现并发编程的强大工具，它提供了高级接口来异步执行可调用对象。

1. 理解基本概念

Executor：执行器类，包括 ThreadPoolExecutor（线程池）和 ProcessPoolExecutor（进程池）是核心类，负责管理线程/进程的生命周期。
提交任务：通过 submit() 提交单个任务，或 map() 批量提交任务。
Future：表示异步计算的对象，用于获取结果或状态。可通过 result() 获取（阻塞），或 as_completed() 迭代已完成的任务。
适用场景：理解线程(IO密集型)和进程(CPU密集型)的区别
- 线程池：I/O 密集型任务（如网络请求、文件读写），利用线程等待 I/O 的时间切换执行其他任务。
- 进程池：CPU 密集型任务（如数值计算），绕过 Python GIL（全局解释器锁）限制，利用多核 CPU。

安装与导入：concurrent.futures 是标准库，无需额外安装。

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
import time  # 用于计时

2. 基本使用模式

操作	方法/说明
创建线程池/进程池	`ThreadPoolExecutor(max_workers=N)` 或 `ProcessPoolExecutor(max_workers=N)`
提交单个任务	`executor.submit(func, args, *kwargs)` → 返回 Future 对象
批量提交任务	`executor.map(func, iterable)` → 返回按输入顺序的迭代器（不支持异常捕获）
异步获取结果	`future.result(timeout=None)` → 阻塞直到完成（超时可选）
迭代已完成的任务	`as_completed(futures)` → 返回按完成顺序的 Future 迭代器
取消未执行的任务	`future.cancel()` → 仅当任务未开始时可能成功（返回布尔值）

3. 举例

ProcessPoolExecutor 与 ThreadPoolExecutor 的区别

特性	ProcessPoolExecutor	ThreadPoolExecutor
执行模型	使用多进程	使用多线程
内存使用	每个进程有独立内存空间	线程共享内存空间
GIL影响	绕过GIL，真正并行	受GIL限制，I/O密集型有效
启动开销	较高（需要创建新进程）	较低（创建线程更轻量）
数据共享	需要通过IPC机制	可以直接共享数据（但需同步）
适用场景	CPU密集型任务	I/O密集型任务

任务建议

1. 批量复制CSV并改名的任务

这是一个典型的I/O密集型操作，主要时间花费在文件读写上。对于这种任务：

推荐使用 ThreadPoolExecutor
原因：线程创建开销小，且I/O操作期间会释放GIL，允许其他线程运行

2. CSV转DataFrame处理再保存的任务

这个任务包含I/O操作（读取CSV）和CPU操作（DataFrame处理）：

推荐使用 ProcessPoolExecutor
原因：pandas操作是CPU密集型的，多进程可以绕过GIL限制，真正利用多核

合理设置工作线程/进程数

I/O密集型：可以设置较多线程（如CPU核心数的2-4倍）
CPU密集型：通常设置为CPU核心数

I/O 密集型任务

def fetch_url(url):
    """模拟网络请求：延迟 1 秒后返回 URL"""
    time.sleep(1)  # 模拟 I/O 等待
    return f"Data from {url}"

# 主程序
if __name__ == "__main__":
    urls = ["https://example.com", "https://google.com", "https://github.com"]

    # 1. 创建线程池（默认线程数为 CPU 核心数 * 5，可通过 max_workers 指定）
    with ThreadPoolExecutor(max_workers=3) as executor:
        # 2. 提交任务（方式 1：逐个提交，返回 Future 对象）
        futures = [executor.submit(fetch_url, url) for url in urls]

        # 3. 获取结果（方式 1：按完成顺序遍历）
        for future in as_completed(futures):
            try:
                result = future.result()  # 阻塞直到任务完成
                print(result)
            except Exception as e:
                print(f"Task failed: {e}")

        # （可选）方式 2：批量提交并按输入顺序获取结果（map 方法）
        results = executor.map(fetch_url, urls)  # 返回迭代器，按 urls 顺序输出
        print(list(results))  # 输出：["Data from ...", ...]

CPU 密集型任务

def calculate_factorial(n):
    """计算阶乘（CPU 密集型）"""
    result = 1
    for i in range(1, n+1):
        result *= i
    return result

if __name__ == "__main__":
    numbers = [1000, 2000, 3000, 4000]

    # 使用进程池（绕过 GIL，利用多核）
    with ProcessPoolExecutor(max_workers=4) as executor:
        # 提交任务并获取结果
        futures = [executor.submit(calculate_factorial, num) for num in numbers]
        for future in as_completed(futures):
            print(f"Factorial result: {future.result()}")

主要注意事项：

参数传递方式：
- 修改 ProcessPoolExecutor 管理的函数，使其接受单个元组参数而不是多个参数
- 这是因为多进程环境中，参数需要能够被序列化（pickle）
进程数设置：
- 将默认工作线程数改为使用 os.cpu_count() 获取CPU核心数
- 对于CPU密集型任务（如pandas数据处理），通常设置为CPU核心数
Windows平台注意事项：
- 在Windows上使用多进程时，必须将主要代码放在 if __name__ == "__main__": 块中
- 这是为了防止子进程递归创建新进程

4. 计时和模板

使用 time.perf_counter()（高精度计时器）测量整个并发任务的执行时间：

import time
from concurrent.futures import ThreadPoolExecutor

def process_data(a, b):
    time.sleep(0.5)
    return f"Result: {a}-{b}"

if __name__ == "__main__":
    start_time = time.perf_counter()  # 记录开始时间

    with ThreadPoolExecutor(max_workers=3) as executor:
        # 提交 5 个任务（多变量）
        futures = [
            executor.submit(process_data, i, f"str_{i}") 
            for i in range(5)
        ]
        # 等待所有任务完成（可选：遍历结果）
        for future in as_completed(futures):
            print(future.result())

    end_time = time.perf_counter()  # 记录结束时间
    total_time = end_time - start_time
    print(f"Total time: {total_time:.2f} seconds")  # 输出约 0.5~1 秒（并发执行）

import concurrent.futures
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Callable, Iterable, Any, List, Dict, Optional, Union
import logging
from tqdm import tqdm
import os
from pathlib import Path
import shutil
import traceback

def run_parallel_tasks(
    process_func: Callable,
    tasks: Iterable[Any],
    max_workers: Optional[int] = None,
    task_description: str = "Processing",
    unit: str = "task",
    return_results: bool = False,
    show_progress: bool = True
) -> Union[List[Any], None]:
    """
    通用多线程处理模板函数
    
    参数:
        process_func: 主要处理函数
        tasks: 任务参数列表，每个元素将传递给process_func
        max_workers: 最大线程数，默认为None（自动计算）
        task_description: 进度条描述文本
        unit: 进度条单位
        return_results: 是否返回结果
        show_progress: 是否显示进度条
    
    返回:
        如果return_results为True，返回所有任务的结果列表
        否则返回None
    """
    # 自动设置线程数
    if max_workers is None:
        cpu_count = os.cpu_count() or 1
        # 对于I/O密集型任务，通常设置为CPU核心数的2-4倍
        max_workers = min(max(4, cpu_count * 2), 20, len(tasks))
    
    results = [] if return_results else None
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # 提交所有任务
        future_to_task = {
            executor.submit(process_func, task): task 
            for task in tasks
        }
        
        # 设置进度条
        if show_progress:
            pbar = tqdm(
                total=len(tasks),
                desc=task_description,
                unit=unit,
                bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]"
            )
        
        # 处理完成的任务
        for future in as_completed(future_to_task):
            task = future_to_task[future]
            
            try:
                result = future.result()
                if return_results:
                    results.append(result)
            except Exception as e:
                logging.error(f"任务失败: {task}, 错误: {str(e)}")
                if return_results:
                    results.append(None)  # 或者可以返回异常对象: results.append(e)
            finally:
                if show_progress:
                    pbar.update(1)
        
        if show_progress:
            pbar.close()
    
    return results

import os
import glob
import pandas as pd
from concurrent.futures import ProcessPoolExecutor, as_completed
import multiprocessing
from pathlib import Path

# 全局配置
col_identifier = 'CC'  # 示例值
new_idx = 1  # 示例值

def process_single_file(file_info):
    """
    处理单个CSV文件 - 优化内存使用
    """
    source_file, target_file = file_info
    
    try:
        # 使用低内存模式读取CSV
        # 方法1: 指定数据类型减少内存使用
        dtypes = {
            'col1': 'category',  # 根据实际列调整
            'col2': 'float32',
            # 其他列...
        }
        
        # 方法2: 只读取需要的列
        usecols = ['col1', 'col2', 'col3']  # 根据实际需要调整
        
        # 方法3: 分块处理大文件
        chunk_size = 10000  # 根据内存调整
        
        # 根据文件大小选择处理策略
        file_size = os.path.getsize(source_file) / (1024 * 1024)  # MB
        
        if file_size > 500:  # 大文件使用分块处理
            chunks = []
            for chunk in pd.read_csv(source_file, chunksize=chunk_size, usecols=usecols, dtype=dtypes):
                # 处理每个块的数据
                # chunk = your_processing_function(chunk)
                chunks.append(chunk)
            
            # 合并所有块
            df = pd.concat(chunks, ignore_index=True)
        else:
            # 小文件直接读取
            df = pd.read_csv(source_file, usecols=usecols, dtype=dtypes)
        
        # 数据处理逻辑
        # df = your_data_processing(df)
        
        # 优化保存
        df.to_csv(target_file, index=False)
        
        return target_file
    except Exception as e:
        print(f"处理文件 {source_file} 时出错: {str(e)}")
        return None

def process_data_files_parallel(source_root, target_root):
    """
    并行处理数据文件 - 优化内存使用
    """
    max_workers = 4
    
    processed_files = []
    
    # 使用进程池并行处理
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        # 提交所有任务
        future_to_file = {
            executor.submit(process_single_file, task): task 
            for task in file_tasks
        }
        
        # 收集结果
        for future in as_completed(future_to_file):
            try:
                result = future.result(timeout=7200)  # 2小时超时
                if result:
                    processed_files.append(result)
                    print(f"已完成: {result}")
            except Exception as exc:
                file_task = future_to_file[future]
                print(f"处理文件 {file_task[0]} 时生成异常: {exc}")
    
    return processed_files

def main():
    # 配置路径
    SOURCE_ROOT = "/path/to/your/source/root"
    TARGET_ROOT = "/path/to/your/target/root"
    
    print("开始处理大型CSV文件...")
    
    # 处理文件
    processed_files = process_data_files_parallel(SOURCE_ROOT, TARGET_ROOT)
    
    print(f"处理完成！共处理了 {len(processed_files)} 个文件")

if __name__ == "__main__":
    # 在Windows上使用多进程时必须保护主模块
    multiprocessing.freeze_support()
    main()