第一个Map/Reduce程序-优快云博客

本文介绍了一种使用Hadoop MapReduce框架来统计包名的页面浏览量(PV)和独立访客(UV)的方法。通过两次MapReduce作业实现了数据统计，首次作业以用户设备标识和包名为键进行计数，第二次作业则计算各包名的PV和UV。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

2019独角兽企业重金招聘Python工程师标准>>>

1. 前面遇到的问题：WordCount在 cygwin中执行没问题，但是eclipse中一直存在问题
参考：http://sunjun041640.blog.163.com/blog/static/25626832201061751825292/ 已经说的很详细了，

我在eclipse的Advanced parameters中设置还是没有生效，所有还是选择利用代码解决了，conf.set("mapred.child.tmp", "d:/hadoop/temp");

2. 入职第一个任务

1.描述：根据包名进行PV和UV的统计

2.数据：

UDID PACK IP (自己定义的名字，实际文件中不存在)

1 oschina.net ***

2 baidu.com ***

2 oschina.net ***

PV是浏览量，UV是独立访客，所有PV就是每个包被访问的总次数，UV就是在PV的基础上去重

3.算法设计

先以UDID-PACK为第一个Map的key，以1为value。第一次Map后输出为（1-oschina.net , 1）（1-oschina.net , 1）（2-baidu.com , 1）(2-oschina.net,1)；在经过排序和分组后，Reduce看到以下输入：（1-oschina.net,[1,1]）（2-baidu.com,[1]）(2-oschina.net,[1]),然后Reduce中统计（和wordCount类似），最后输出为（1-oschiane,2）（2-baidu.com,1）(2-oschina.net,1)，保存到文件中的格式应该是：

UDID-PAK COUNT

1-oschina.net 2

2-baidu.com 1

2-oschina.net 1

第二个Map/Reduce过程要开始计算PV和UV了。以PACK为key，以COUNT为value，第二次Map后输出为（oschina.net,2）（baidu.com,1）(oschina.net,1)，同样在排序和分组后，Reduce得到的输入为（oschina.net,[2,1]）（baidu.com,[1]），在Reduce对数组的值进行sum得到的就是PV，对数组的长度sum后得到的就是UV。

第一次写Hadoop，用了一个笨方法。2个Map/Reduce过程写在了2个工程中，第二个工程使用第一个工程的输出作为输入

http://blog.youkuaiyun.com/chaoping315/article/details/6221440（这是个旧版本的解决方案）

第一个Map/Reduce

public static class UDID_PACK_Mapper extends
			Mapper<Object, Text, Text, IntWritable> {

		private final static IntWritable one = new IntWritable(1);
		private Text word = new Text();

		public void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
			// split(,)第2个字段是UDID，第11个字段是包名
			String line = value.toString();
			String details[] = line.split(",");
			String UDID = details[1];
			String PACK = details[10];
			String UDID_PACK_String = UDID + "-" + PACK;
			word.set(UDID_PACK_String);

			if(!"null".equals(UDID) && !"".equals(UDID)){
				context.write(word, one);
			}
		}
	}

	public static class UDID_PACK_Reducer extends
			Reducer<Text, IntWritable, Text, IntWritable> {
		private IntWritable result = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values,
				Context context) throws IOException, InterruptedException {
			int sum = 0;
			for (IntWritable val : values) {
				sum += val.get();
			}
			result.set(sum);
			context.write(key, result);
		}
	}

第二个Map/Reduce

public static class PACK_Mapper extends
			Mapper<Object, Text, Text, IntWritable> {
		
		private IntWritable packCount = new IntWritable();
		private Text word = new Text();

		public void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
			//012345678901234-com.TE.wallpaper.ReceiverRestrictedContext	1
			// split("\\s{1,}"),再split("-")
			String line = value.toString();
			String first[] = line.split("\\s{1,}");
			String second[] = first[0].split("-");
			
			String PACK = second[1];
			int COUNT = Integer.parseInt(first[1]);
			
			word.set(PACK);
			packCount.set(COUNT);
			
			context.write(word, packCount);
		}
	}

	public static class PACK_Reducer extends
			Reducer<Text, IntWritable, Text, IntWritable> {
		private IntWritable packPV = new IntWritable();
		private IntWritable packUV = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values,
				Context context) throws IOException, InterruptedException {
			int sum = 0;
			int length = 0;
			for (IntWritable val : values) {
				sum += val.get();
				length++;
			}
			packPV.set(sum);
			packUV.set(length);
			context.write(key, packPV);
			context.write(key, packUV);
		}
	}

####注意:

1.运行用run on hadoop 的方式，用open run dialog的方式不会编译

####遗留问题####

1.一个工程中执行2个Map/Reduce

我在程序中尝试建2个job去执行，但是第二个Job的Map的输入时第一个Job的输出，不知道该怎么设置，请大侠赐教。

--------- 低头拉车，抬头看路

转载于:https://my.oschina.net/wangjiankui/blog/40747