Project-Based-Learning大数据处理:Hadoop、Spark分布式计算实战
前言:大数据时代的挑战与机遇
在当今数据爆炸的时代,企业每天产生TB甚至PB级别的数据。传统单机处理方式已无法满足海量数据的存储、处理和分析需求。你是否曾面临以下困境:
- 数据量过大,单机处理耗时数小时甚至数天?
- 需要实时处理流式数据但不知从何入手?
- 想要构建分布式系统但担心复杂度太高?
- 对Hadoop、Spark等大数据技术感兴趣但缺乏实战经验?
本文将带你从零开始,通过实际项目案例,深入理解Hadoop和Spark的核心原理,掌握分布式计算的实战技能。读完本文,你将能够:
- ✅ 搭建Hadoop和Spark分布式集群环境
- ✅ 编写MapReduce程序处理海量数据
- ✅ 使用Spark进行高效的内存计算
- ✅ 构建实时流处理管道
- ✅ 优化分布式作业性能
一、大数据技术栈全景图
1.1 Hadoop生态系统核心组件
| 组件 | 功能描述 | 适用场景 |
|---|---|---|
| HDFS | 分布式文件系统 | 海量数据存储,一次写入多次读取 |
| MapReduce | 分布式计算框架 | 批处理,ETL作业 |
| YARN | 资源管理系统 | 集群资源调度和管理 |
| HBase | 分布式NoSQL数据库 | 实时随机读写,海量数据存储 |
| ZooKeeper | 分布式协调服务 | 配置管理,命名服务,分布式同步 |
1.2 Spark生态系统核心组件
| 组件 | 功能描述 | 性能优势 |
|---|---|---|
| Spark Core | 基础计算引擎 | 内存计算,比MapReduce快10-100倍 |
| Spark SQL | 结构化数据处理 | 支持SQL查询,DataFrame API |
| Spark Streaming | 流式数据处理 | 微批处理,低延迟 |
| Spark MLlib | 机器学习库 | 分布式机器学习算法 |
| GraphX | 图计算库 | 图处理和分析 |
二、环境搭建与集群部署
2.1 硬件需求规划
2.2 Hadoop集群安装配置
2.2.1 基础环境准备
# 设置主机名解析
echo "192.168.1.10 hadoop-master" >> /etc/hosts
echo "192.168.1.11 hadoop-worker1" >> /etc/hosts
echo "192.168.1.12 hadoop-worker2" >> /etc/hosts
# 创建Hadoop用户
useradd hadoop
passwd hadoop
# 配置SSH免密登录
su - hadoop
ssh-keygen -t rsa
ssh-copy-id hadoop@hadoop-master
ssh-copy-id hadoop@hadoop-worker1
ssh-copy-id hadoop@hadoop-worker2
2.2.2 Hadoop配置文件
core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/hdfs/data</value>
</property>
</configuration>
2.3 Spark集群安装配置
2.3.1 Spark环境配置
spark-env.sh:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_MASTER_HOST=hadoop-master
export SPARK_WORKER_MEMORY=16g
export SPARK_WORKER_CORES=8
2.3.2 启动集群服务
# 启动HDFS
start-dfs.sh
# 启动YARN
start-yarn.sh
# 启动Spark
/opt/spark/sbin/start-master.sh
/opt/spark/sbin/start-workers.sh
# 验证集群状态
hdfs dfsadmin -report
yarn node -list
三、MapReduce实战:词频统计案例
3.1 MapReduce编程模型
3.2 Java实现词频统计
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
// Mapper类
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
// Reducer类
public static class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
// 主函数
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
3.3 运行与监控MapReduce作业
# 编译打包
javac -cp $(hadoop classpath) WordCount.java
jar cf wc.jar WordCount*.class
# 准备测试数据
echo "hello world hello hadoop" > input.txt
hdfs dfs -put input.txt /input/
# 提交作业
hadoop jar wc.jar WordCount /input /output
# 监控作业状态
yarn application -list
yarn application -status <application_id>
# 查看结果
hdfs dfs -cat /output/part-r-00000
四、Spark核心编程实战
4.1 Spark RDD编程模型
4.1.1 RDD转换操作示例
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCount {
def main(args: Array[String]): Unit = {
// 创建Spark配置
val conf = new SparkConf()
.setAppName("Spark Word Count")
.setMaster("spark://hadoop-master:7077")
// 创建SparkContext
val sc = new SparkContext(conf)
try {
// 读取文本文件创建RDD
val textFile = sc.textFile("hdfs://hadoop-master:9000/input/large_file.txt")
// 转换操作:词频统计
val wordCounts = textFile
.flatMap(line => line.split(" ")) // 扁平化映射
.map(word => (word, 1)) // 映射为键值对
.reduceByKey(_ + _) // 按key聚合
.sortBy(_._2, false) // 按词频降序排序
// 行动操作:保存结果
wordCounts.saveAsTextFile("hdfs://hadoop-master:9000/output/spark_wordcount")
// 收集部分结果到驱动程序
val top10 = wordCounts.take(10)
top10.foreach(println)
} finally {
sc.stop()
}
}
}
4.1.2 DataFrame API实战
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._
object SparkDataFrameExample {
def main(args: Array[String]): Unit = {
// 创建SparkSession
val spark = SparkSession.builder()
.appName("Spark DataFrame Example")
.master("spark://hadoop-master:7077")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.getOrCreate()
import spark.implicits._
try {
// 创建示例DataFrame
val data = Seq(
("Alice", 25, "Engineering"),
("Bob", 30, "Sales"),
("Charlie", 35, "Engineering"),
("David", 28, "Marketing"),
("Eva", 32, "Sales")
)
val df = data.toDF("name", "age", "department")
// 执行SQL查询
df.createOrReplaceTempView("employees")
val result = spark.sql("""
SELECT department,
AVG(age) as avg_age,
COUNT(*) as count
FROM employees
GROUP BY department
ORDER BY avg_age DESC
""")
// 显示结果
result.show()
// DataFrame转换操作
val filtered = df
.filter($"age" > 25)
.groupBy("department")
.agg(
count("*").as("employee_count"),
avg("age").as("average_age")
)
.orderBy(desc("average_age"))
filtered.show()
} finally {
spark.stop()
}
}
}
4.2 Spark性能优化策略
4.2.1 内存管理优化
// 优化Spark配置
val conf = new SparkConf()
.set("spark.executor.memory", "8g") // Executor内存
.set("spark.executor.cores", "4") // 每个Executor核心数
.set("spark.sql.adaptive.enabled", "true") // 自适应查询执行
.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
.set("spark.sql.adaptive.skewJoin.enabled", "true")
.set("spark.sql.autoBroadcastJoinThreshold", "10485760") // 10MB
// 缓存策略选择
val df = spark.read.parquet("hdfs://path/to/data")
df.cache() // 内存缓存
df.persist() // 持久化到内存或磁盘
// 广播变量优化小表join
val smallTable = spark.read.parquet("hdfs://path/to/small_table")
val broadcastTable = broadcast(smallTable)
val result = largeTable.join(broadcastTable, Seq("key"))
4.2.2 数据分区优化
// 重新分区优化
val repartitioned = df.repartition(100, $"department") // 按部门分区
// 控制分区数
val optimalPartitions = df.rdd.getNumPartitions match {
case n if n > 200 => 200
case n if n < 10 => 10
case n => n
}
val optimized = df.coalesce(optimalPartitions)
// 自定义分区器
import org.apache.spark.Partitioner
class DepartmentPartitioner(numParts: Int) extends Partitioner {
override def numPartitions: Int = numParts
override def getPartition(key: Any): Int = {
val dept = key.asInstanceOf[String]
math.abs(dept.hashCode % numPartitions)
}
}
val partitioned = df.rdd
.map(row => (row.getAs[String]("department"), row))
.partitionBy(new DepartmentPartitioner(10))
.map(_._2)
五、实时流处理实战
5.1 Spark Streaming架构
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



