# 题目重述
根据提供的《开源大数据平台技术-非笔试类阶段考核试卷B.docx》文件内容,本次考核要求学生完成一个基于Hadoop伪分布式环境的MapReduce项目,具体任务如下:
1. **部署Hadoop伪分布式集群**,确保各服务正常启动,并提供截图证明;
2. **构建Maven工程**,在IDEA中配置Hadoop、HDFS、MapReduce相关依赖;
3. **上传超市三个月销售数据至HDFS**,并通过Shell命令验证上传结果;
4. **开发两个MapReduce程序**:
- 第一个:按月份统计每件商品的销售总量(仅统计);
- 第二个:在统计基础上,按月份对所有商品销售量进行**降序排序**,需实现自定义比较器;
5. **编写主驱动类**,完成Job配置;
6. **使用Maven打包成jar**,提交到Hadoop节点,通过`yarn`命令运行,输出结果至控制台;
7. **撰写实验报告PDF**,包含操作描述、截图、原理说明与个人收获。
---
# 最佳实施方案(完整路径)
## 一、前期准备
### 1. 环境准备
- **操作系统**:推荐 Ubuntu 20.04 / CentOS 7(虚拟机或云服务器)
- **JDK版本**:JDK 8(Hadoop不支持JDK 9+)
- **Hadoop版本**:Hadoop 3.3.6 或 3.3.4(稳定版)
- **开发工具**:IntelliJ IDEA + Maven 3.8+
- **数据文件**:准备`sales_data.csv`格式如下:
```
order_id,product_name,sales_count,month
001,牙膏,5,1
002,洗发水,3,1
003,牙膏,4,2
...
```
---
## 二、Hadoop伪分布式部署(满分关键)
### 步骤:
1. 修改 `$HADOOP_HOME/etc/hadoop/` 下的核心配置文件:
- `core-site.xml`
```xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
```
- `hdfs-site.xml`
```xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/tmp/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/tmp/dn</value>
</property>
</configuration>
```
- `mapred-site.xml`
```xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
```
- `yarn-site.xml`
```xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
```
2. 初始化HDFS:
```bash
hdfs namenode -format
```
3. 启动服务:
```bash
start-dfs.sh
start-yarn.sh
```
4. 验证:
```bash
jps
# 应显示:NameNode, DataNode, ResourceManager, NodeManager, SecondaryNameNode
```
5. 截图保存 `jps` 输出和服务状态页面(如 `http://localhost:9870`, `http://localhost:8088`)
---
## 三、IDEA中创建Maven项目
### 1. 创建Maven项目
- GroupId: `com.example.bigdata`
- ArtifactId: `SalesAnalysis`
### 2. 添加依赖(`pom.xml`)
```xml
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.6</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.example.bigdata.DriverApp</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
```
---
## 四、HDFS上传数据
```bash
# 创建HDFS目录
hdfs dfs -mkdir /input
# 上传本地数据
hdfs dfs -put ./sales_data.csv /input/
# 查看是否上传成功
hdfs dfs -ls /input/
hdfs dfs -cat /input/sales_data.csv | head -5
```
> ⚠️ 截图保留以上命令执行结果!
---
## 五、MapReduce开发
### 1. 商品按月统计(作业一)
#### Mapper类
```java
public class SalesCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final Text outKey = new Text();
private final IntWritable one = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
if (line.startsWith("order_id")) return; // 跳过标题
String[] fields = line.split(",");
String product = fields[1];
String month = fields[3];
int salesCount = Integer.parseInt(fields[2]);
outKey.set(month + "_" + product);
context.write(outKey, new IntWritable(salesCount));
}
}
```
#### Reducer类
```java
public class SalesCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
```
---
### 2. 按月统计并排序(作业二)
#### 新Key结构(组合键)
```java
public class MonthSalesKey implements WritableComparable<MonthSalesKey> {
private String month;
private int sales;
public MonthSalesKey() {}
public MonthSalesKey(String month, int sales) {
this.month = month;
this.sales = sales;
}
// getter/setter...
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(month);
out.writeInt(sales);
}
@Override
public void readFields(DataInput in) throws IOException {
month = in.readUTF();
sales = in.readInt();
}
@Override
public int compareTo(MonthSalesKey o) {
int cmp = this.month.compareTo(o.month);
if (cmp != 0) return cmp;
return Integer.compare(o.sales, this.sales); // 降序
}
@Override
public String toString() {
return month + "\t" + sales;
}
}
```
#### 自定义比较器(降序)
```java
public class DescendingComparator extends WritableComparator {
protected DescendingComparator() {
super(MonthSalesKey.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
MonthSalesKey k1 = (MonthSalesKey) a;
MonthSalesKey k2 = (MonthSalesKey) b;
int cmp = k1.month.compareTo(k2.month);
if (cmp != 0) return cmp;
return Integer.compare(k2.sales, k1.sales); // 降序
}
}
```
#### 排序Mapper(将销售额作为排序字段)
```java
public class SortMapper extends Mapper<LongWritable, Text, MonthSalesKey, Text> {
private final MonthSalesKey mapKey = new MonthSalesKey();
private final Text mapValue = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
if (line.startsWith("month_product") || line.startsWith("month")) return;
String[] parts = line.split("\t");
String monthProduct = parts[0];
int sales = Integer.parseInt(parts[1]);
String[] mp = monthProduct.split("_", 2);
String month = mp[0];
String product = mp[1];
mapKey.month = month;
mapKey.sales = sales;
mapValue.set(monthProduct + "\t" + sales);
context.write(mapKey, mapValue);
}
}
```
#### 排序Reducer(恒等输出)
```java
public class SortReducer extends Reducer<MonthSalesKey, Text, Text, IntWritable> {
private final Text outputKey = new Text();
private final IntWritable outputVal = new IntWritable();
@Override
protected void reduce(MonthSalesKey key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text val : values) {
String[] parts = val.toString().split("\t");
outputKey.set(parts[0]);
outputVal.set(Integer.parseInt(parts[1]));
context.write(outputKey, outputVal);
}
}
}
```
---
### 3. 主驱动类(DriverApp.java)
```java
public class DriverApp {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// 第一步:统计
Job job1 = Job.getInstance(conf, "Monthly Sales Count");
job1.setJarByClass(DriverApp.class);
job1.setMapperClass(SalesCountMapper.class);
job1.setReducerClass(SalesCountReducer.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job1, new Path("/input/sales_data.csv"));
FileOutputFormat.setOutputPath(job1, new Path("/output_step1"));
job1.waitForCompletion(true);
// 第二步:排序
Job job2 = Job.getInstance(conf, "Sort by Sales Descending");
job2.setJarByClass(DriverApp.class);
job2.setMapperClass(SortMapper.class);
job2.setReducerClass(SortReducer.class);
job2.setOutputKeyClass(MonthSalesKey.class);
job2.setOutputValueClass(Text.class);
job2.setSortComparatorClass(DescendingComparator.class);
job2.setGroupingComparatorClass(MonthGroupingComparator.class);
job2.setNumReduceTasks(1); // 保证全局有序
FileInputFormat.addInputPath(job2, new Path("/output_step1/part-r-00000"));
FileOutputFormat.setOutputPath(job2, new Path("/output_final"));
System.exit(job2.waitForCompletion(true) ? 0 : 1);
}
}
```
> 📌 注:可添加`MonthGroupingComparator`使同一月份的数据归为一组。
---
## 六、打包与运行
### 1. Maven打包
```bash
mvn clean package
```
生成:`target/SalesAnalysis-1.0-SNAPSHOT.jar`
### 2. 复制到Hadoop机器并运行
```bash
# 删除旧输出目录
hdfs dfs -rm -r /output_step1 /output_final
# 提交任务
yarn jar SalesAnalysis-1.0-SNAPSHOT.jar com.example.bigdata.DriverApp
```
### 3. 查看结果
```bash
hdfs dfs -cat /output_final/part-r-00000
```
预期输出示例:
```
1_牙膏 9
1_洗发水 7
2_牙膏 12
...
```
> ✅ 截图保留:yarn命令执行、日志输出、最终结果
---
## 七、撰写实验报告(PDF格式)
### 报告结构建议:
1. **封面页**:课程名称、姓名、学号、日期
2. **任务概述**:简述目标(按月统计+排序)
3. **环境部署截图**:`jps`、Web UI界面(NN/RM)
4. **项目结构截图**:IDEA中src/main/java结构
5. **HDFS上传截图**
6. **代码关键片段截图**(Mapper、Reducer、Comparator、Driver)
7. **运行命令与结果截图**
8. **技术原理解析**:
- MapReduce三阶段:Map → Shuffle → Reduce
- 自定义Key与比较器作用
- Combiner优化思想(虽未用,可提及)
9. **收获与感想**(建议300字以上,结合调试经历写)
---
## 八、避坑指南(助你拿高分)
| 常见错误 | 解决方案 |
|--------|---------|
| `ClassNotFoundException` | 使用`maven-shade-plugin`打包,确保Main-Class嵌入 |
| `Connection refused: 9000` | 检查`core-site.xml`中`fs.defaultFS`是否为`hdfs://localhost:9000` |
| 权限问题 | 关闭防火墙,SSH免密登录localhost |
| 输出为空 | 检查输入路径是否存在,Mapper是否过滤了第一行 |
| 排序无效 | 确保设置了`setSortComparatorClass()`,且`compareTo()`逻辑正确 |
---
# 知识点
**1. MapReduce编程模型**
Map阶段处理输入记录,Reduce阶段聚合中间结果,适用于大规模离线计算。
**2. HDFS Shell命令操作**
使用`hdfs dfs -put`, `-ls`, `-cat`等命令管理分布式文件系统中的数据。
**3. 自定义数据类型与比较器**
通过实现`WritableComparable`接口和`WritableComparator`类控制排序行为。