PocketFlow项目中的MapReduce设计模式解析-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01195/article/details/148507496

PocketFlow项目中的MapReduce设计模式解析

PocketFlow Minimalist LLM Framework in 100 Lines. Enable LLMs to Program Themselves. 项目地址: https://gitcode.com/gh_mirrors/poc/PocketFlow

什么是MapReduce设计模式

MapReduce是一种经典的分布式计算模式，特别适合处理大规模数据集。在PocketFlow项目中，MapReduce被实现为一种高效的任务处理范式，主要用于以下两种场景：

输入数据量大的情况（如需要处理多个文件）
输出数据量大的情况（如需要生成多种形式的输出）

这种模式的核心思想是将复杂任务分解为更小、更易管理的子任务，然后对结果进行聚合。

MapReduce的工作原理

MapReduce模式包含两个主要阶段：

Map阶段：使用BatchNode将大型任务分解为多个独立的小任务
Reduce阶段：将Map阶段的结果进行聚合处理

这种分而治之的方法不仅提高了处理效率，还使得代码结构更加清晰。

实际应用示例：文档摘要生成

让我们通过一个文档摘要生成的例子来理解MapReduce在PocketFlow中的实际应用。

场景描述

假设我们有多个文本文件，需要完成以下任务：

为每个文件生成单独的摘要
将所有文件的摘要合并成一个最终的综合摘要

实现代码解析

class SummarizeAllFiles(BatchNode):
    def prep(self, shared):
        files_dict = shared["files"]  # 获取所有文件数据
        return list(files_dict.items())  # 转换为可迭代的列表形式

    def exec(self, one_file):
        filename, file_content = one_file
        # 调用LLM为单个文件生成摘要
        summary_text = call_llm(f"Summarize the following file:\n{file_content}")
        return (filename, summary_text)

    def post(self, shared, prep_res, exec_res_list):
        # 将所有文件摘要存入共享字典
        shared["file_summaries"] = dict(exec_res_list)

class CombineSummaries(Node):
    def prep(self, shared):
        return shared["file_summaries"]

    def exec(self, file_summaries):
        # 格式化所有文件摘要
        text_list = []
        for fname, summ in file_summaries.items():
            text_list.append(f"{fname} summary:\n{summ}\n")
        big_text = "\n---\n".join(text_list)

        # 调用LLM生成最终综合摘要
        return call_llm(f"Combine these file summaries into one final summary:\n{big_text}")

    def post(self, shared, prep_res, final_summary):
        shared["all_files_summary"] = final_summary

# 构建处理流程
batch_node = SummarizeAllFiles()
combine_node = CombineSummaries()
batch_node >> combine_node

flow = Flow(start=batch_node)

# 准备输入数据
shared = {
    "files": {
        "file1.txt": "Alice was beginning to get very tired of sitting by her sister...",
        "file2.txt": "Some other interesting text ...",
        # 更多文件...
    }
}

# 执行流程
flow.run(shared)

# 输出结果
print("Individual Summaries:", shared["file_summaries"])
print("\nFinal Summary:\n", shared["all_files_summary"])

代码结构解析

SummarizeAllFiles类（Map阶段）：
- prep方法：准备待处理的文件列表
- exec方法：处理单个文件，生成摘要
- post方法：收集所有文件的摘要结果
CombineSummaries类（Reduce阶段）：
- 接收所有文件的摘要
- 生成最终的综合性摘要
流程构建：
- 将两个节点连接起来形成完整流程
- 执行流程并输出结果