Spark-Core(MapPartitions内存优化)

本文深入探讨了Apache Spark中的内存管理策略,包括静态内存管理和统一内存管理的不同,并详细解析了执行内存与存储内存的交互机制,以及如何通过调整配置参数优化Spark应用性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Spark-Core(MapPartitions/内存优化)

1.map

map是对RDD中的每个一元素作用一个函数或者是算子,简单理解就是对每个元素进行一个f(x)的操作

Return a new RDD by applying a function to all elements of this RDD.

2.mapPartition

RDD:n Partition :N Record

RDD由多个partition构成,而每个partition又由多条记录构成

map作用函数是作用在Record(记录)上,而mapPartition是作用在(partition)上

3.foreach foreachPartition

4.textFile

默认分区数是2

Mapper/Reducer 类方法是4个参数,map/reduce方法是3个参数

map(pair => pair._2.toString) 只获取value,不要偏移量

5.spark-shell

#linux版本
uname
#显示版本信息
uname -r
#显示全部信息
uname -a

#if -z 判断字符串长度是否为0
if [ -z "aaa" ]

#当前目录
home=`cd $(dirname $"0");pwd`
echo ${home}

###6.Serialization(序列化)

1.Java serilization

​ slow

​ large

2.Kryo serilization

​ register the classes

​ quickly

​ compact

//更换序列化类型
//然后将使用到的类注册
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.registerKryoClasses(Array(classOf[Info]))
class Info

7.Level of Parallelism

reduceBykey跟groupByKey可以设置分区数来提升并行度,一般是一个cpu内运行2-3个task,目的是为了充分利用cpu

8.内存管理

1.execution

​ computation in shuffles, joins, sorts and aggregations

2.storage

​ caching

execution能向storage的借,反之亦然

Execution may evict stroage if necessary ,如果必要,执行可以去占用存储的内存

Storage may not evict execution 但是存储不能去占用执行

静态内存管理(1.5版本之前的使用的静态内存管理)

StaticMemoryManager.getMaxExecutionMemory(conf)得到最大执行内存

StaticMemoryManager.getmaxStorageMemory(conf)得到最大存储内存

private def getMaxExecutionMemory(conf: SparkConf): Long = {
    val systemMaxMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
		//判断是否小于系统内存,系统内存默认是32M
    if (systemMaxMemory < MIN_MEMORY_BYTES) {
      throw new IllegalArgumentException(s"System memory $systemMaxMemory must " +
        s"be at least $MIN_MEMORY_BYTES. Please increase heap size using the --driver-memory " +
        s"option or spark.driver.memory in Spark configuration.")
    }
    if (conf.contains("spark.executor.memory")) {
      val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
      if (executorMemory < MIN_MEMORY_BYTES) {
        throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " +
          s"$MIN_MEMORY_BYTES. Please increase executor memory using the " +
          s"--executor-memory option or spark.executor.memory in Spark configuration.")
      }
    }
  	//shuffle使用的内存系数
    val memoryFraction = conf.getDouble("spark.shuffle.memoryFraction", 0.2)
    //安全系数
    val safetyFraction = conf.getDouble("spark.shuffle.safetyFraction", 0.8)
    // 10G * 0.2 * 0.8 = 1.6G
    (systemMaxMemory * memoryFraction * safetyFraction).toLong
  }


private def getMaxStorageMemory(conf: SparkConf): Long = {
    val systemMaxMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
    val memoryFraction = conf.getDouble("spark.storage.memoryFraction", 0.6)
    val safetyFraction = conf.getDouble("spark.storage.safetyFraction", 0.9)
    // 10G * 0.6 * 0.9 = 5.4G
    (systemMaxMemory * memoryFraction * safetyFraction).toLong
  }

统一内存管理(1.5版本之后的内存管理)

 private def getMaxMemory(conf: SparkConf): Long = {
    val systemMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
    val reservedMemory = conf.getLong("spark.testing.reservedMemory",
      if (conf.contains("spark.testing")) 0 else RESERVED_SYSTEM_MEMORY_BYTES)
    //跟上面的系统内存判断一样,这边系统内存要求是450M
    val minSystemMemory = (reservedMemory * 1.5).ceil.toLong
    if (systemMemory < minSystemMemory) {
      throw new IllegalArgumentException(s"System memory $systemMemory must " +
        s"be at least $minSystemMemory. Please increase heap size using the --driver-memory " +
        s"option or spark.driver.memory in Spark configuration.")
    }
    // SPARK-12759 Check executor memory to fail fast if memory is insufficient
    if (conf.contains("spark.executor.memory")) {
      val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
      if (executorMemory < minSystemMemory) {
        throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " +
          s"$minSystemMemory. Please increase executor memory using the " +
          s"--executor-memory option or spark.executor.memory in Spark configuration.")
      }
    }
   
   //10G - 300M 
    val usableMemory = systemMemory - reservedMemory
    val memoryFraction = conf.getDouble("spark.memory.fraction", 0.6)
   //(10G - 300M)*0.6  执行存储的共用内存
    (usableMemory * memoryFraction).toLong
  }

  def apply(conf: SparkConf, numCores: Int): UnifiedMemoryManager = {
    val maxMemory = getMaxMemory(conf)
    new UnifiedMemoryManager(
      conf,
      maxHeapMemory = maxMemory,
      onHeapStorageRegionSize =
      //(10G - 300M)*0.6*0.5  存储的使用内存
        (maxMemory * conf.getDouble("spark.memory.storageFraction", 0.5)).toLong,
      numCores = numCores)
  }

两种不同的内存管理方式的占比情况

9.GC垃圾回收

http://spark.apache.org/docs/latest/tuning.html#memory-tuning

minor GC

full GC

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值