Spark-Core(MapPartitions内存优化)

最新推荐文章于 2024-12-17 14:34:29 发布

原创最新推荐文章于 2024-12-17 14:34:29 发布 · 726 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Spark #Spark Core

hadoop 同时被 2 个专栏收录

42 篇文章

订阅专栏

Spark

18 篇文章

订阅专栏

本文深入探讨了Apache Spark中的内存管理策略，包括静态内存管理和统一内存管理的不同，并详细解析了执行内存与存储内存的交互机制，以及如何通过调整配置参数优化Spark应用性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark-Core(MapPartitions/内存优化)

1.map

map是对RDD中的每个一元素作用一个函数或者是算子，简单理解就是对每个元素进行一个f(x)的操作

Return a new RDD by applying a function to all elements of this RDD.

2.mapPartition

RDD:n Partition :N Record

RDD由多个partition构成，而每个partition又由多条记录构成

map作用函数是作用在Record(记录)上，而mapPartition是作用在(partition)上

3.foreach foreachPartition

4.textFile

默认分区数是2

Mapper/Reducer 类方法是4个参数，map/reduce方法是3个参数

map(pair => pair._2.toString) 只获取value，不要偏移量

5.spark-shell

#linux版本
uname
#显示版本信息
uname -r
#显示全部信息
uname -a

#if -z 判断字符串长度是否为0
if [ -z "aaa" ]

#当前目录
home=`cd $(dirname $"0");pwd`
echo ${home}

###6.Serialization(序列化)

1.Java serilization

slow

large

2.Kryo serilization

quickly

compact

//更换序列化类型
//然后将使用到的类注册
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.registerKryoClasses(Array(classOf[Info]))
class Info

7.Level of Parallelism

reduceBykey跟groupByKey可以设置分区数来提升并行度，一般是一个cpu内运行2-3个task，目的是为了充分利用cpu

8.内存管理

1.execution

computation in shuffles, joins, sorts and aggregations

2.storage

caching

execution能向storage的借，反之亦然

Execution may evict stroage if necessary ，如果必要，执行可以去占用存储的内存

Storage may not evict execution 但是存储不能去占用执行

静态内存管理(1.5版本之前的使用的静态内存管理)

StaticMemoryManager.getMaxExecutionMemory(conf)得到最大执行内存

StaticMemoryManager.getmaxStorageMemory(conf)得到最大存储内存

private def getMaxExecutionMemory(conf: SparkConf): Long = {
    val systemMaxMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
		//判断是否小于系统内存，系统内存默认是32M
    if (systemMaxMemory < MIN_MEMORY_BYTES) {
      throw new IllegalArgumentException(s"System memory $systemMaxMemory must " +
        s"be at least $MIN_MEMORY_BYTES. Please increase heap size using the --driver-memory " +
        s"option or spark.driver.memory in Spark configuration.")
    }
    if (conf.contains("spark.executor.memory")) {
      val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
      if (executorMemory < MIN_MEMORY_BYTES) {
        throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " +
          s"$MIN_MEMORY_BYTES. Please increase executor memory using the " +
          s"--executor-memory option or spark.executor.memory in Spark configuration.")
      }
    }
  	//shuffle使用的内存系数
    val memoryFraction = conf.getDouble("spark.shuffle.memoryFraction", 0.2)
    //安全系数
    val safetyFraction = conf.getDouble("spark.shuffle.safetyFraction", 0.8)
    // 10G * 0.2 * 0.8 = 1.6G
    (systemMaxMemory * memoryFraction * safetyFraction).toLong
  }


private def getMaxStorageMemory(conf: SparkConf): Long = {
    val systemMaxMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
    val memoryFraction = conf.getDouble("spark.storage.memoryFraction", 0.6)
    val safetyFraction = conf.getDouble("spark.storage.safetyFraction", 0.9)
    // 10G * 0.6 * 0.9 = 5.4G
    (systemMaxMemory * memoryFraction * safetyFraction).toLong
  }

统一内存管理(1.5版本之后的内存管理)

 private def getMaxMemory(conf: SparkConf): Long = {
    val systemMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
    val reservedMemory = conf.getLong("spark.testing.reservedMemory",
      if (conf.contains("spark.testing")) 0 else RESERVED_SYSTEM_MEMORY_BYTES)
    //跟上面的系统内存判断一样，这边系统内存要求是450M
    val minSystemMemory = (reservedMemory * 1.5).ceil.toLong
    if (systemMemory < minSystemMemory) {
      throw new IllegalArgumentException(s"System memory $systemMemory must " +
        s"be at least $minSystemMemory. Please increase heap size using the --driver-memory " +
        s"option or spark.driver.memory in Spark configuration.")
    }
    // SPARK-12759 Check executor memory to fail fast if memory is insufficient
    if (conf.contains("spark.executor.memory")) {
      val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
      if (executorMemory < minSystemMemory) {
        throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " +
          s"$minSystemMemory. Please increase executor memory using the " +
          s"--executor-memory option or spark.executor.memory in Spark configuration.")
      }
    }
   
   //10G - 300M 
    val usableMemory = systemMemory - reservedMemory
    val memoryFraction = conf.getDouble("spark.memory.fraction", 0.6)
   //(10G - 300M)*0.6  执行存储的共用内存
    (usableMemory * memoryFraction).toLong
  }

  def apply(conf: SparkConf, numCores: Int): UnifiedMemoryManager = {
    val maxMemory = getMaxMemory(conf)
    new UnifiedMemoryManager(
      conf,
      maxHeapMemory = maxMemory,
      onHeapStorageRegionSize =
      //(10G - 300M)*0.6*0.5  存储的使用内存
        (maxMemory * conf.getDouble("spark.memory.storageFraction", 0.5)).toLong,
      numCores = numCores)
  }