Map-Reduce: Shuffle and Sort

本文深入解析了Hadoop MapReduce的工作流程,包括shuffle和sort阶段、分区器与组合器的作用、复制阶段到Reducer以及最终的Reduce阶段。详细阐述了Mapper如何缓冲写入、数据如何分区与排序、Reducer如何接收并处理数据,以及整个过程中的内存管理和磁盘操作。

Introduction

The map phase guarantees that the input to the reducer will be sorted on its key. The process by which output of the mapper is sorted and transferred across to the reducers is known as the shuffle.

The following figure (taken from Hadoop-The Definitive Guide) illustrates the shuffle and sort phase:




Buffering Map Writes

The mapper does not directly write to the disk rather it takes advantage of buffering the writes. Each mapper has a circular memory buffer with a default size of 100MB which can be tweaked by changing the io.sort.mb property. It does the flush in a very smart manner. When the buffer is filled up to a certain threshold (default of 80%- can be changed by tweaking the io.sort.spill.percent property) , a separate thread gets triggered which spills the content in the buffer to the disk.

Role of the Partitioner & Combiner

Before the spill happens to the disk, the thread (which is entrusted with the task of performing the spill) partitions the data according to the reducers it needs to go to and a background thread performs an in-memory sort within the partition based on the key. If a combiner is present, it consumes the output of the in-memory sort. There maybe several spill files which get generated as a part of the above process, hence at the end of the map phase an on-disk merge is performed to form bigger partitions (bigger in size and less in number - number depends on the number of reducers) and the sorting order is taken care of during the merge process. 

Copy phase to the Reducer

Now, the output of the several map tasks is sitting on different nodes and it needs to get copied over to the node on which the reducer is going to run in order to consume the output of the map tasks. If the data from the map tasks is able to fit inside the reducer's tasktracker memory, then an in-memory merge  is performed of the sorted map output files coming from different nodes. As soon as a threshold is reached, the merged output is written onto the disk and the process repeated till all the map tasks have been accounted for this reducer's partition. Then, an on-disk merge is performed in groups of files and a final group of files is directly feeded into the reducer performing an in-memory merge while feeding (thus saving an extra trip to the disk).

Final Step :  The reduce phase

From the final merge (which was a mixture of an in-memory and on-disk merge) , the data is fed to the reduce phase which may optionally perform some further processing and finally the data is written to HDFS.
内容概要:本文详细介绍了一个基于Java和Vue的联邦学习隐私保护推荐系统的设计与实现。系统采用联邦学习架构,使用户数据在本地完成模型训练,仅上传加密后的模型参数或梯度,通过中心服务器进行联邦平均聚合,从而实现数据隐私保护与协同建模的双重目标。项目涵盖完整的系统架构设计,包括本地模型训练、中心参数聚合、安全通信、前后端解耦、推荐算法插件化等模块,并结合差分隐私与同态加密等技术强化安全性。同时,系统通过Vue前端实现用户行为采集与个性化推荐展示,Java后端支撑高并发服务与日志处理,形成“本地训练—参数上传—全局聚合—模型下发—个性化微调”的完整闭环。文中还提供了关键模块的代码示例,如特征提取、模型聚合、加密上传等,增强了项目的可实施性与工程参考价值。 适合人群:具备一定Java和Vue开发基础,熟悉Spring Boot、RESTful API、分布式系统或机器学习相关技术,从事推荐系统、隐私计算或全栈开发方向的研发人员。 使用场景及目标:①学习联邦学习在推荐系统中的工程落地方法;②掌握隐私保护机制(如加密传输、差分隐私)与模型聚合技术的集成;③构建高安全、可扩展的分布式推荐系统原型;④实现前后端协同的个性化推荐闭环系统。 阅读建议:建议结合代码示例深入理解联邦学习流程,重点关注本地训练与全局聚合的协同逻辑,同时可基于项目架构进行算法替换与功能扩展,适用于科研验证与工业级系统原型开发。
源码来自:https://pan.quark.cn/s/a4b39357ea24 遗传算法 - 简书 遗传算法的理论是根据达尔文进化论而设计出来的算法: 人类是朝着好的方向(最优解)进化,进化过程中,会自动选择优良基因,淘汰劣等基因。 遗传算法(英语:genetic algorithm (GA) )是计算数学中用于解决最佳化的搜索算法,是进化算法的一种。 进化算法最初是借鉴了进化生物学中的一些现象而发展起来的,这些现象包括遗传、突变、自然选择、杂交等。 搜索算法的共同特征为: 首先组成一组候选解 依据某些适应性条件测算这些候选解的适应度 根据适应度保留某些候选解,放弃其他候选解 对保留的候选解进行某些操作,生成新的候选解 遗传算法流程 遗传算法的一般步骤 my_fitness函数 评估每条染色体所对应个体的适应度 升序排列适应度评估值,选出 前 parent_number 个 个体作为 待选 parent 种群(适应度函数的值越小越好) 从 待选 parent 种群 中随机选择 2 个个体作为父方和母方。 抽取父母双方的染色体,进行交叉,产生 2 个子代。 (交叉概率) 对子代(parent + 生成的 child)的染色体进行变异。 (变异概率) 重复3,4,5步骤,直到新种群(parentnumber + childnumber)的产生。 循环以上步骤直至找到满意的解。 名词解释 交叉概率:两个个体进行交配的概率。 例如,交配概率为0.8,则80%的“夫妻”会生育后代。 变异概率:所有的基因中发生变异的占总体的比例。 GA函数 适应度函数 适应度函数由解决的问题决定。 举一个平方和的例子。 简单的平方和问题 求函数的最小值,其中每个变量的取值区间都是 [-1, ...
从你提供的日志信息来看,Job2(Pass2)失败的主要原因是: ``` Reduce input records: 0 Reduce output records: 0 Bytes Written: 0 ``` 这说明 **Reducer 没有收到任何输入数据**,也就无法产生输出。 --- ## 🔍 可能原因分析 ### ✅ 1. Mapper 输出为空 ```text Map output records: 4 Reduce input groups: 0 ``` 虽然 Map 阶段输出了 4 条记录,但在 ShuffleSort 阶段这些记录没有传递给 Reduce,可能是因为: - Reducer 的 key 类型不一致(如 key 是 `Text`,但误用了 `LongWritable`) - Key 在 shuffle 过程中被去重或未正确排序 - Partitioner 设置错误导致所有 key 被分配到不存在的 reducer 分区 --- ### ✅ 2. 输入数据格式问题 你在 `Pass2Mapper` 中使用了如下方式解析输入: ```java String[] parts = value.toString().trim().split("\\s+"); if (parts.length >= 2) { ... } ``` 如果输入数据不是“事务ID + 多个item”的形式(例如只有单个 item),会导致 map 方法跳过写入操作,最终 reducer 收不到任何数据。 --- ### ✅ 3. 缓存文件加载失败 在 `Pass2Reducer.setup()` 中尝试加载缓存文件失败,可能导致 `frequentItems` 为空,从而使得 reduce 阶段无法生成任何组合。 --- ## ✅ 解决方案 我们将对代码进行以下改进: 1. **增强日志打印**:确认 mapper 是否正常处理每条记录 2. **确保频繁项集加载成功** 3. **添加 fallback 逻辑**:即使没有频繁项也输出原始项 4. **添加计数器统计输入/输出情况** --- ## ✅ 修改后的完整 Java MapReduce 代码(Pass2) 我们只展示 **Pass2Mapper** 和 **Pass2Reducer** 的修改部分(其余部分保持不变): ### 🧩 Pass2Mapper.java(增强日志) ```java public static class Pass2Mapper extends Mapper<LongWritable, Text, Text, Text> { private Text transactionId = new Text(); private Text items = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString().trim(); String[] parts = line.split("\\s+"); System.err.println("Processing line in mapper: " + line); if (parts.length >= 2) { transactionId.set(parts[0]); String[] itemArray = Arrays.copyOfRange(parts, 1, parts.length); items.set(String.join(" ", itemArray)); context.write(transactionId, items); } else { System.err.println("Invalid input line (length < 2): " + line); } } } ``` --- ### 🧩 Pass2Reducer.java(增强日志、fallback、计数器) ```java public static class Pass2Reducer extends Reducer<Text, Text, Text, Text> { private Map<String, Integer> frequentItems = new HashMap<>(); private int minSupport; private Counter inputRecords; private Counter outputRecords; @Override protected void setup(Context context) throws IOException { minSupport = context.getConfiguration().getInt("min.support", 2); URI[] cacheFiles = context.getCacheFiles(); if (cacheFiles == null || cacheFiles.length == 0) { throw new IOException("No cache files available"); } try (BufferedReader reader = new BufferedReader(new FileReader(cacheFiles[0].toString()))) { String line; while ((line = reader.readLine()) != null) { String[] parts = line.trim().split("\\s+"); if (parts.length >= 2) { frequentItems.put(parts[0], Integer.parseInt(parts[1])); System.err.println("Loaded frequent item: " + parts[0] + " => " + parts[1]); } } } if (frequentItems.isEmpty()) { throw new IOException("Failed to load any frequent items from cache"); } inputRecords = context.getCounter("Pass2Reducer", "INPUT_RECORDS"); outputRecords = context.getCounter("Pass2Reducer", "OUTPUT_RECORDS"); } public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { inputRecords.increment(1); System.err.println("Processing key in reducer: " + key.toString()); for (Text val : values) { String[] items = val.toString().trim().split("\\s+"); List<String> filteredItems = new ArrayList<>(); for (String item : items) { if (frequentItems.containsKey(item)) { filteredItems.add(item); } else { System.err.println("Item not in frequent set: " + item); } } if (!filteredItems.isEmpty()) { // 生成组合项集 for (int k = 1; k <= filteredItems.size(); k++) { generateAndCountItemsets(filteredItems, k, context); } } else { // 即使没有频繁项,也输出原始项作为 fallback for (String item : items) { context.write(new Text(item), new Text("1")); outputRecords.increment(1); } } } } private void generateAndCountItemsets(List<String> items, int k, Context context) throws IOException, InterruptedException { List<String> combination = new ArrayList<>(k); for (int i = 0; i < k; i++) { combination.add(null); } combineItems(items, k, 0, combination, 0, context); } private void combineItems(List<String> items, int k, int start, List<String> combination, int index, Context context) throws IOException, InterruptedException { if (index == k) { String itemset = String.join(",", combination); context.write(new Text(itemset), new Text("1")); outputRecords.increment(1); return; } for (int i = start; i <= items.size() - k + index; i++) { combination.set(index, items.get(i)); combineItems(items, k, i + 1, combination, index + 1, context); } } } ``` --- ## ✅ 主程序提交 Job2 的部分(确保分布式缓存设置正确) ```java Job job2 = Job.getInstance(conf2, "PFP-Growth Pass 2"); job2.setJarByClass(PFPGrowth.class); job2.setMapperClass(Pass2Mapper.class); job2.setReducerClass(Pass2Reducer.class); job2.setOutputKeyClass(Text.class); job2.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job2, new Path(inputPath)); FileOutputFormat.setOutputPath(job2, new Path(outputPath + "/pass2")); job2.addCacheFile(new Path(outputPath + "/pass1/part-r-00000").toUri()); ``` --- ## ✅ 建议的调试方法 1. **查看 HDFS 上 `/output4/pass1/part-r-00000` 文件内容是否正确** ```bash hadoop fs -cat /output4/pass1/part-r-00000 ``` 2. **检查 Mapper 日志是否有 invalid line 提示** 3. **运行本地测试用例模拟 mapper/reducer 行为** --- ## ✅ 总结修复点 | 问题 | 修复方法 | |------|----------| | Reduce input records 为 0 | 添加日志确认 mapper 是否正常输出 | | frequentItems 为空 | 打印加载的每个 item | | 没有任何输出 | 添加 fallback 输出原始项 | | 无日志提示 | 添加详细日志和计数器 | --- ##
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值