MapReduce编程:最大值、最小值、平均值、计数、中位数、标准差

博客介绍了如何使用MapReduce进行最大值、最小值、平均值、计数等统计分析,特别是针对Stack Overflow评论数据。通过transformXmlToMap函数解析XML文件,Mapper和Reducer实现计算,包括最大最小值类的核心逻辑,并讨论MapReduce的内在排序机制。此外,还提及了计算平均值和标准差的方法,以及中位数的朴素实现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

MapReduce编程最基础的范例应该就是Wordcount了,然后大部分就是要做一遍最大值最小值的计算。课上老师用的课本是《MapReduce编程与设计模式》,里面第一章就介绍了Wordcount ,接下来就是最大值最小值平均值标准差,其数据来源于Stack Overflow网站上的评论内容,包括评论时间、评论用户ID,评论文本。并且是以.xml文件形式做输入文件。因此读入到mapper时需要先将xml转化为map的键值对形式。transformXmlToMap(value.toString());
以下是输入文件的形式,随便造的几组数据,只改动了评论时间与用户ID,评论文本内容是直接粘的。

<row Id="1" PostId="35314" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2018-09-06T08:07:10.730" UserId="1" /> 
<row Id="1" PostId="35315" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2007-09-06T08:05:33.730" UserId="1" />
<row Id="1" PostId="35316" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:07:10.730" UserId="1" /> 
<row Id="1" PostId="35317" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-08-06T08:07:26.730" UserId="1" /> 
<row Id="2" PostId="35318" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-05-06T08:11:10.730" UserId="1" /> 
<row Id="2" PostId="35319" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:12:10.730" UserId="1" /> 
<row Id="2" PostId="35320" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-06-06T08:03:10.730" UserId="1" /> 
<row Id="2" PostId="35321" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-09-06T08:07:10.880" UserId="1" /> 
<row Id="2" PostId="35322" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2016-09-06T08:07:39.730" UserId="1" /> 
<row Id="2" PostId="35323" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2008-03-06T08:07:10.730" UserId="1" /> 
<row Id="3" PostId="35324" Score="39" Text="not sure why this is getting downvoted -- it is correct! Double check it in your compiler if you don't believe him!" CreationDate="2007-09-06T08:00:22.730" UserId="1" /> 

这在课本上是没有看到这个函数的内部实现的,但是仍是一个基本的工具类,可以自己实现,目的就是将文本抠出来转换成map形式存储。

public static final String[] REDIS_INSTANCES = {
    "p0", "p1", "p2", "p3",
			"p4", "p6" };

	// This helper function parses the stackoverflow into a Map for us.
	public static Map<String, String> transformXmlToMap(String xml) {
   
		Map<String, String> map = new HashMap<String, String>();
		try {
   
			String[] tokens = xml.trim().substring(5, xml.trim().length() - 3)
					.split("\"");

			for (int i = 0; i < tokens.length - 1; i += 2) {
   
				String key = tokens[i].trim();
				String val = tokens[i + 1];

				map.put(key.substring(0, key.length() - 1), val);
			}
		} catch (StringIndexOutOfBoundsException e) {
   
			System.err.println(xml);
		}

		return map;
	}

然后接下来就是一个最大最小值类:
该代码完全取自课本原文,但接下来的平均值和标准差的计算,课本限于篇幅就没有打出完整代码,只写出了核心的mapper部分与reducer部分。但是仍可根据最大最小值的范例写出模板化的主类部分。

首先可以看到计算最大最小值类中存在三个域值,可以知道该代码处理的是一个用户多个评论中的最早评论时间【最小值】与最晚评论时间【最大值】,并且还包括一个count计算用户总量。
并且值得注意的是其中重载了toString函数,作为输出格式。

其次,MapReduce本身在运行过程中自带排序,无论用户目的是否需要排序,其都会在mapper过程中自动排序,因为排序对大多数处理都是有利的,这也是MapReduce本身的机制。

package mapreduce_2019;
import java.io.DataInput;
import java.io.DataOutput;
import java.io
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值