实验五 MapReduce的编程实践
一、 实验目的
a) 通过实验掌握基本的MapReduce编程方法;
b) 掌握用MapReduce解决一些常见的数据处理问题,包括数据去重、数据排序和数据挖掘等。
二、 实验原理
Map函数+Reduce函数+Shuffle过程
三、 实验平台
a) Ubuntu 20.0.1
b) Hadoop 3.3.1(至少完成伪分布模式)
c) JDK 1.8.0_301
d) Eclipse 2019-12(R)
四、 实验步骤
1、修改配置文件,使Hadoop集群可以运行MapReduce模型
1.1修改配置文件:mapred-site.xml
cd /opt/module/hadoop/etc/hadoop
vi mapred-site.xml
添加一下内容:
#/opt/module/hadoop/是我的Hadoop的安装目录,改成你的
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/module/hadoop/</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/module/hadoop/</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/module/hadoop/</value>
</property>
1.2修改配置文件:yarn-site.xml
cd /opt/module/hadoop/etc/hadoop
vi yarn-site.xml
添加内容:
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
1.3修改配置文件:hdfs-site.xml
cd /opt/module/hadoop/etc/hadoop
改hdfs-site.xml,去掉权限管理,方便用过webUI操作HDFS
vi hdfs-site.xml
添加内容:
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
2、重启Hadoop集群
关闭Hadoop集群
cd /opt/module/hadoop/sbin
./stop-all.sh
启动关闭Hadoop集群
cd /opt/module/hadoop/sbin
./start-all.sh
jps检查集群状态
1. 实现WordCount
创建文件wordcount.txt
内容为:
hadoop flink java
mysql hadoop
hdfs mysql spark
java hadoop
HBASE HBASE
在本地创建文件再通过hadoop fs -put方法上传文件
1.1通过java语言,按照mapreduce思想编写了一段统计文件单词量的代码,编译成 hadoop-mapreduce-examples-3.3.4.jar,进入/opt/module/hadoop/share/hadoop/mapreduce
目录下执行:
cd /opt/module/hadoop/share/hadoop/mapreduce
然后执行:
hadoop jar hadoop-mapreduce-examples-3.3.4.jar wordcount wordcount.txt result14
结果:
1.2IntelliJ IDEA编程
准备:
环境变量配置HADOOP_HOME
HADOOP_HOME
D:\hadoop-3.3.4 以实际位置修改
Path配置
%HADOOP_HOME%\bin
%HADOOP_HOME%\sbin
winutils文件复制
解压winutils-master.zip后,进入bin目录:
然后把里面的文件复制到目录:D:\hadoop-3.3.4\bin
把hadoop.dll文件复制到目录:
C:\Windows\System32
导入相关jar包:lib2
创建WordCount类:
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public WordCount() {
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCount.TokenizerMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/wordcount.txt"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/result15"));
System.exit(job.waitForCompletion(true)?0:1);
}
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public TokenizerMapper() {
}
public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while(itr.hasMoreTokens()) {
this.word.set(itr.nextToken());
context.write(this.word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public IntSumReducer() {
}
public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int sum = 0;
IntWritable val;
for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {
val = (IntWritable)i$.next();
}
this.result.set(sum);
context.write(key, this.result);
}
}
}
2. Java API编程实现对输入文件的排序
现在有多个输入文件,每个文件中的每行内容均为一个整数。要求读取所有文件中的整数,进行升序排序后,输出到一个新的文件中,输出的数据格式为每行两个整数,第一个数字为第二个整数的排序位次,第二个整数为原待排列的整数。
下面是输入文件和输出文件的一个样例供参考。
输入文件1的样例如下:
注意文件不能有空行
33
37
12
40
输入文件2的样例如下:
4
16
39
5
输入文件3的样例如下:
1
45
25
根据输入文件1、2和3得到的输出文件如下:
1 1
2 4
3 5
4 12
5 16
6 25
7 33
8 37
9 39
10 40
11 45
(1) 输入
多个输入文件,每个文件中的每行内容均为一个整数。
在hdfs上新建个data1文件夹,将多个文件上传到data1。
(2) 处理
创建MergeSort类:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class MergeSort {
// map函数读取输入中的value,将其转化成IntWritable类型,最后作为输出key
public static class Map extends Mapper<Object, Text, IntWritable, IntWritable> {
private static IntWritable data = new IntWritable();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String text = value.toString();
data.set(Integer.parseInt(text)); // 如果txt的行不是整数,解析字符串会报错
context.write(data, new IntWritable(1));
}
}
// reduce函数将map输入的key复制到输出的value上,然后根据输入的value-list中元素的个数决定key的输出次数,定义一个全局变量line_num来代表key的位次
public static class Reduce extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
private static IntWritable line_num = new IntWritable(1);
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
for (IntWritable val : values) {
context.write(line_num, key);
line_num = new IntWritable(line_num.get() + 1);
}
}
}
// 自定义Partition函数,此函数根据输入数据的最大值和MapReduce框架中Partition的数量获取将输入数据按照大小分块的边界,然后根据输入数值和边界的关系返回对应的Partiton
// ID
public static class Partition extends Partitioner<IntWritable, IntWritable> {
public int getPartition(IntWritable key, IntWritable value, int num_Partition) {
int Maxnumber = 65223;// int型的最大数值
int bound = Maxnumber / num_Partition + 1;
int keynumber = key.get();
for (int i = 0; i < num_Partition; i++) {
if (keynumber < bound * (i + 1) && keynumber >= bound * i) {
return i;
}
}
return -1;
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
// if (otherArgs.length != 2) {
// System.err.println("Usage: wordcount <in><out>");
// System.exit(2);
// }
Job job = Job.getInstance(conf, "Merge and sort");
job.setJarByClass(MergeSort.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setPartitionerClass(Partition.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/data1"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/data1_result"));
System.exit(job.waitForCompletion(true)?0:1);
}
}
(3) 输出
hadoop fs -cat data1_result/*
实验报告(第6次)
实验名称 MapReduce的编程实践2 实验时间 第 周第 次课
同组同学 小组分工
一、实验目的
a) 通过实验掌握基本的MapReduce编程方法;
b) 掌握用MapReduce解决一些常见的数据处理问题,包括数据去重、数据排序和数据挖掘等。
二、实验仪器设备或材料
a) Ubuntu 22.04.3
b) Hadoop 3.1.3
c) HBase 2.2.2
d) JDK 1.8.0
e) Eclipse
三、实验原理
Map函数+Reduce函数+Shuffle过程
四、实验内容与步骤
1.Java API编程实现商品销售总量计算
题目:统计每种商品的总销售数量。
数据文件:sales_data.txt
结果:
代码:
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class fuirt_sale {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "fuirt_sale");
job.setJarByClass(fuirt_sale.class);
job.setMapperClass(fuirt_sale.TokenizerMapper.class);
job.setReducerClass(fuirt_sale.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/sales_data.txt"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/result7"));
System.exit(job.waitForCompletion(true)?0:1);
}
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable quantity = new IntWritable();
private Text fruit = new Text();
public TokenizerMapper() {
}
public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
String[] tokens = value.toString().split(" ");
fruit.set(tokens[0]);
quantity.set(Integer.parseInt(tokens[1]));
context.write(fruit, quantity);
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public IntSumReducer() {
}
public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int sum = 0;
IntWritable val;
for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {
val = (IntWritable)i$.next();
}
this.result.set(sum);
context.write(key, this.result);
}
}
}
2. Java API编程实现矩阵-向量乘法
数据文件:Matrix.txt
结果:
代码:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class Matrix_vector {
public static class MatrixVectorMapper extends Mapper<Object, Text, IntWritable, IntWritable> {
private IntWritable rowKey = new IntWritable();
private IntWritable productValue = new IntWritable();
private int[] vector = new int[10]; // 假设最大向量长度为10
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens = value.toString().split(",");
// 处理向量元素
if (tokens[0].equals("B")) {
int index = Integer.parseInt(tokens[1]);
int vectorValue = Integer.parseInt(tokens[2]);
vector[index] = vectorValue; // 存储向量值
System.out.println("Vector index: " + index + ", Value: " + vectorValue);
}
// 处理矩阵元素
else if (tokens[0].equals("A")) {
int row = Integer.parseInt(tokens[1]);
int col = Integer.parseInt(tokens[2]);
int matrixValue = Integer.parseInt(tokens[3]);
// 计算乘积
int vectorValue = vector[col]; // 获取对应的向量元素
int product = matrixValue * vectorValue; // 计算乘积
rowKey.set(row);
productValue.set(product);
// 输出 row 和 product
context.write(rowKey, productValue);
// 调试输出
System.out.println("Row: " + row + ", Matrix Value: " + matrixValue + ", Vector Value: " + vectorValue + ", Product: " + product);
}
}
}
public static class MatrixVectorReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Matrix_vector");
job.setJarByClass(Matrix_vector.class);
job.setMapperClass(MatrixVectorMapper.class);
job.setReducerClass(MatrixVectorReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/Matrix.txt"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/result8"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
3. Java API编程实现共同好友计算
数据文件:friends.txt
数据讲解:
A:B,C,D,E,F,M
表示用户A有好友B,C,D,E,F,M
B:C,D,L,E
表示用户B有好友C,D,L,E
需要计算这些用户中,他们两两用户的共同好友是哪些。
比如A和B有哪些共同好友。
A和C有哪些共同好友。
需要两两用户都要组合遍历计算。
注意对称性,A和B组合是和B和A组合是一样的,只需要计算一次。
计算思路:
1.Map阶段
A:B,C,D,E,F,M
1.1通过字符:分割成两部分,A和B,C,D,E,F,M
1.2再次通过,字符分割B,C,D,E,F,M,获取到一个字符串数组[B,C,D,E,F,M]。
1.3遍历字符串数组,两两组合,B-C、B-D、B-E、B-F、B-M、C-D、C-E、C-F、C-M、
E-F、E-M、F-M
1.4构建K-V结构,(B-C,A)、(B-D,A)、(B-E,A)、(B-F,A)、(B-M,A)、(C-D,A)、(C-E,A)、(B-D,A)、(C-F,A)、(C-M,A)、(E-F,A)、(E-M,A)、(F-M,A)
B:C,D,L,E同样处理
1.1通过字符:分割成两部分,B和C,D,L,E
1.2再次通过,字符分割C,D,L,E,获取到一个字符串数组[C,D,L,E]。
1.3遍历字符串数组,两两组合,C-D、C-L、C-E、D-L、D-E、L-E
1.4构建K-V结构,(C-D,B)、(C-L,B)、(C-E,B)、(D-L,B)、(D-E,B)、(L-E,B)
...................................
其他数据都是这样处理
2.Reduce阶段
2.1相同的key进行归并
(C-D,A)和(C-D,B)的key值相同,可以归并成(C-D,<A,B>)
这样就可以计算出,用户C和D,他们共同好友是A,B。
其它数据处理逻辑一样。
结果部分截图:
代码:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.Arrays;
public class Mutual_friend {
public static class Map extends Mapper<LongWritable, Text,Text, Text> {
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//拆分数据
String[] split = value.toString().split(":");
System.out.print(split[0]);
System.out.print(split[1]);
String[] split1 = split[1].split(",");
Arrays.sort(split1);//排序去重
for (int i = 0; i < split1.length-1; i++) {
//利用两个for循环对结果一一匹配
for (int j = i+1; j < split1.length; j++) {
context.write(new Text(split1[i]+"-"+split1[j]),new Text(需补充));
}
}
}
}
public static class Reduce extends Reducer<Text, Text,Text,Text> {
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
StringBuffer sb=new StringBuffer();
for (Text te:values) {
sb.append(te).append(",");
}
context.write(key, new Text(sb.toString()));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"friends");
job.setJarByClass(Mutual_friend.class);
//设置Mapper类
job.setMapperClass(Map.class);
//设置Reducer类
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/friends.txt"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.182.100:9000/user/root/result32"));
System.exit(job.waitForCompletion(true)?0:1);
}
}
五、实验结果与分析
(截取主要运行结果界面即可)
六、结论与体会
七、教师评语
这是以前做过的实验给你仅供参考,在Linux环境下运行Hadoop自带的MapReduce程序帮我完成基于 MapReduce 的离线批处理分析
• 使用 MapReduce 完成至少 2个具体问题的数据分析任务。
• 需以 HDFS 路径作为输入和输出。
• 问题分析应具有一定复杂性。
• 至少完成一次基于HDFS Shell脚本查看分析结果。
• 至少完成一次基于HDFS Java API查看分析结果。
(需要展示编程代码和相关的运行结果的截图,截图尽可能详细。)