Project-Based-Learning大数据处理:Hadoop、Spark分布式计算实战

Project-Based-Learning大数据处理:Hadoop、Spark分布式计算实战

【免费下载链接】project-based-learning 这是一个经过筛选整理的、以项目实践为导向的教程合集,旨在帮助开发者通过实际项目案例学习和掌握相关技术知识点。 【免费下载链接】project-based-learning 项目地址: https://gitcode.com/GitHub_Trending/pr/project-based-learning

前言:大数据时代的挑战与机遇

在当今数据爆炸的时代,企业每天产生TB甚至PB级别的数据。传统单机处理方式已无法满足海量数据的存储、处理和分析需求。你是否曾面临以下困境:

  • 数据量过大,单机处理耗时数小时甚至数天?
  • 需要实时处理流式数据但不知从何入手?
  • 想要构建分布式系统但担心复杂度太高?
  • 对Hadoop、Spark等大数据技术感兴趣但缺乏实战经验?

本文将带你从零开始,通过实际项目案例,深入理解Hadoop和Spark的核心原理,掌握分布式计算的实战技能。读完本文,你将能够:

  • ✅ 搭建Hadoop和Spark分布式集群环境
  • ✅ 编写MapReduce程序处理海量数据
  • ✅ 使用Spark进行高效的内存计算
  • ✅ 构建实时流处理管道
  • ✅ 优化分布式作业性能

一、大数据技术栈全景图

mermaid

1.1 Hadoop生态系统核心组件

组件功能描述适用场景
HDFS分布式文件系统海量数据存储,一次写入多次读取
MapReduce分布式计算框架批处理,ETL作业
YARN资源管理系统集群资源调度和管理
HBase分布式NoSQL数据库实时随机读写,海量数据存储
ZooKeeper分布式协调服务配置管理,命名服务,分布式同步

1.2 Spark生态系统核心组件

组件功能描述性能优势
Spark Core基础计算引擎内存计算,比MapReduce快10-100倍
Spark SQL结构化数据处理支持SQL查询,DataFrame API
Spark Streaming流式数据处理微批处理,低延迟
Spark MLlib机器学习库分布式机器学习算法
GraphX图计算库图处理和分析

二、环境搭建与集群部署

2.1 硬件需求规划

mermaid

2.2 Hadoop集群安装配置

2.2.1 基础环境准备
# 设置主机名解析
echo "192.168.1.10 hadoop-master" >> /etc/hosts
echo "192.168.1.11 hadoop-worker1" >> /etc/hosts
echo "192.168.1.12 hadoop-worker2" >> /etc/hosts

# 创建Hadoop用户
useradd hadoop
passwd hadoop

# 配置SSH免密登录
su - hadoop
ssh-keygen -t rsa
ssh-copy-id hadoop@hadoop-master
ssh-copy-id hadoop@hadoop-worker1
ssh-copy-id hadoop@hadoop-worker2
2.2.2 Hadoop配置文件

core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop-master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/tmp</value>
    </property>
</configuration>

hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop/hdfs/data</value>
    </property>
</configuration>

2.3 Spark集群安装配置

2.3.1 Spark环境配置

spark-env.sh:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_MASTER_HOST=hadoop-master
export SPARK_WORKER_MEMORY=16g
export SPARK_WORKER_CORES=8
2.3.2 启动集群服务
# 启动HDFS
start-dfs.sh

# 启动YARN
start-yarn.sh

# 启动Spark
/opt/spark/sbin/start-master.sh
/opt/spark/sbin/start-workers.sh

# 验证集群状态
hdfs dfsadmin -report
yarn node -list

三、MapReduce实战:词频统计案例

3.1 MapReduce编程模型

mermaid

3.2 Java实现词频统计

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    // Mapper类
    public static class TokenizerMapper 
        extends Mapper<Object, Text, Text, IntWritable>{
        
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    // Reducer类
    public static class IntSumReducer 
        extends Reducer<Text, IntWritable, Text, IntWritable> {
        
        private IntWritable result = new IntWritable();
        
        public void reduce(Text key, Iterable<IntWritable> values, 
            Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    // 主函数
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

3.3 运行与监控MapReduce作业

# 编译打包
javac -cp $(hadoop classpath) WordCount.java
jar cf wc.jar WordCount*.class

# 准备测试数据
echo "hello world hello hadoop" > input.txt
hdfs dfs -put input.txt /input/

# 提交作业
hadoop jar wc.jar WordCount /input /output

# 监控作业状态
yarn application -list
yarn application -status <application_id>

# 查看结果
hdfs dfs -cat /output/part-r-00000

四、Spark核心编程实战

4.1 Spark RDD编程模型

4.1.1 RDD转换操作示例
import org.apache.spark.{SparkConf, SparkContext}

object SparkWordCount {
  def main(args: Array[String]): Unit = {
    // 创建Spark配置
    val conf = new SparkConf()
      .setAppName("Spark Word Count")
      .setMaster("spark://hadoop-master:7077")
    
    // 创建SparkContext
    val sc = new SparkContext(conf)
    
    try {
      // 读取文本文件创建RDD
      val textFile = sc.textFile("hdfs://hadoop-master:9000/input/large_file.txt")
      
      // 转换操作:词频统计
      val wordCounts = textFile
        .flatMap(line => line.split(" "))        // 扁平化映射
        .map(word => (word, 1))                 // 映射为键值对
        .reduceByKey(_ + _)                     // 按key聚合
        .sortBy(_._2, false)                    // 按词频降序排序
      
      // 行动操作:保存结果
      wordCounts.saveAsTextFile("hdfs://hadoop-master:9000/output/spark_wordcount")
      
      // 收集部分结果到驱动程序
      val top10 = wordCounts.take(10)
      top10.foreach(println)
      
    } finally {
      sc.stop()
    }
  }
}
4.1.2 DataFrame API实战
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._

object SparkDataFrameExample {
  def main(args: Array[String]): Unit = {
    // 创建SparkSession
    val spark = SparkSession.builder()
      .appName("Spark DataFrame Example")
      .master("spark://hadoop-master:7077")
      .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
      .getOrCreate()
    
    import spark.implicits._
    
    try {
      // 创建示例DataFrame
      val data = Seq(
        ("Alice", 25, "Engineering"),
        ("Bob", 30, "Sales"), 
        ("Charlie", 35, "Engineering"),
        ("David", 28, "Marketing"),
        ("Eva", 32, "Sales")
      )
      
      val df = data.toDF("name", "age", "department")
      
      // 执行SQL查询
      df.createOrReplaceTempView("employees")
      
      val result = spark.sql("""
        SELECT department, 
               AVG(age) as avg_age,
               COUNT(*) as count
        FROM employees 
        GROUP BY department
        ORDER BY avg_age DESC
      """)
      
      // 显示结果
      result.show()
      
      // DataFrame转换操作
      val filtered = df
        .filter($"age" > 25)
        .groupBy("department")
        .agg(
          count("*").as("employee_count"),
          avg("age").as("average_age")
        )
        .orderBy(desc("average_age"))
      
      filtered.show()
      
    } finally {
      spark.stop()
    }
  }
}

4.2 Spark性能优化策略

4.2.1 内存管理优化
// 优化Spark配置
val conf = new SparkConf()
  .set("spark.executor.memory", "8g")           // Executor内存
  .set("spark.executor.cores", "4")             // 每个Executor核心数
  .set("spark.sql.adaptive.enabled", "true")    // 自适应查询执行
  .set("spark.sql.adaptive.coalescePartitions.enabled", "true")
  .set("spark.sql.adaptive.skewJoin.enabled", "true")
  .set("spark.sql.autoBroadcastJoinThreshold", "10485760") // 10MB

// 缓存策略选择
val df = spark.read.parquet("hdfs://path/to/data")
df.cache()        // 内存缓存
df.persist()      // 持久化到内存或磁盘

// 广播变量优化小表join
val smallTable = spark.read.parquet("hdfs://path/to/small_table")
val broadcastTable = broadcast(smallTable)

val result = largeTable.join(broadcastTable, Seq("key"))
4.2.2 数据分区优化
// 重新分区优化
val repartitioned = df.repartition(100, $"department")  // 按部门分区

// 控制分区数
val optimalPartitions = df.rdd.getNumPartitions match {
  case n if n > 200 => 200
  case n if n < 10 => 10
  case n => n
}

val optimized = df.coalesce(optimalPartitions)

// 自定义分区器
import org.apache.spark.Partitioner
class DepartmentPartitioner(numParts: Int) extends Partitioner {
  override def numPartitions: Int = numParts
  override def getPartition(key: Any): Int = {
    val dept = key.asInstanceOf[String]
    math.abs(dept.hashCode % numPartitions)
  }
}

val partitioned = df.rdd
  .map(row => (row.getAs[String]("department"), row))
  .partitionBy(new DepartmentPartitioner(10))
  .map(_._2)

五、实时流处理实战

5.1 Spark Streaming架构

【免费下载链接】project-based-learning 这是一个经过筛选整理的、以项目实践为导向的教程合集,旨在帮助开发者通过实际项目案例学习和掌握相关技术知识点。 【免费下载链接】project-based-learning 项目地址: https://gitcode.com/GitHub_Trending/pr/project-based-learning

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值