Spark-Core(MapPartitions/内存优化)
1.map
map是对RDD中的每个一元素作用一个函数或者是算子,简单理解就是对每个元素进行一个f(x)的操作
Return a new RDD by applying a function to all elements of this RDD.
2.mapPartition
RDD:n Partition :N Record
RDD由多个partition构成,而每个partition又由多条记录构成
map作用函数是作用在Record(记录)上,而mapPartition是作用在(partition)上
3.foreach foreachPartition
4.textFile
默认分区数是2
Mapper/Reducer 类方法是4个参数,map/reduce方法是3个参数
map(pair => pair._2.toString) 只获取value,不要偏移量
5.spark-shell
#linux版本
uname
#显示版本信息
uname -r
#显示全部信息
uname -a
#if -z 判断字符串长度是否为0
if [ -z "aaa" ]
#当前目录
home=`cd $(dirname $"0");pwd`
echo ${home}
###6.Serialization(序列化)
1.Java serilization
slow
large
2.Kryo serilization
register the classes
quickly
compact
//更换序列化类型
//然后将使用到的类注册
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.registerKryoClasses(Array(classOf[Info]))
class Info
7.Level of Parallelism
reduceBykey跟groupByKey可以设置分区数来提升并行度,一般是一个cpu内运行2-3个task,目的是为了充分利用cpu
8.内存管理
1.execution
computation in shuffles, joins, sorts and aggregations
2.storage
caching
execution能向storage的借,反之亦然
Execution may evict stroage if necessary ,如果必要,执行可以去占用存储的内存
Storage may not evict execution 但是存储不能去占用执行
静态内存管理(1.5版本之前的使用的静态内存管理)
StaticMemoryManager.getMaxExecutionMemory(conf)得到最大执行内存
StaticMemoryManager.getmaxStorageMemory(conf)得到最大存储内存
private def getMaxExecutionMemory(conf: SparkConf): Long = {
val systemMaxMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
//判断是否小于系统内存,系统内存默认是32M
if (systemMaxMemory < MIN_MEMORY_BYTES) {
throw new IllegalArgumentException(s"System memory $systemMaxMemory must " +
s"be at least $MIN_MEMORY_BYTES. Please increase heap size using the --driver-memory " +
s"option or spark.driver.memory in Spark configuration.")
}
if (conf.contains("spark.executor.memory")) {
val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
if (executorMemory < MIN_MEMORY_BYTES) {
throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " +
s"$MIN_MEMORY_BYTES. Please increase executor memory using the " +
s"--executor-memory option or spark.executor.memory in Spark configuration.")
}
}
//shuffle使用的内存系数
val memoryFraction = conf.getDouble("spark.shuffle.memoryFraction", 0.2)
//安全系数
val safetyFraction = conf.getDouble("spark.shuffle.safetyFraction", 0.8)
// 10G * 0.2 * 0.8 = 1.6G
(systemMaxMemory * memoryFraction * safetyFraction).toLong
}
private def getMaxStorageMemory(conf: SparkConf): Long = {
val systemMaxMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
val memoryFraction = conf.getDouble("spark.storage.memoryFraction", 0.6)
val safetyFraction = conf.getDouble("spark.storage.safetyFraction", 0.9)
// 10G * 0.6 * 0.9 = 5.4G
(systemMaxMemory * memoryFraction * safetyFraction).toLong
}
统一内存管理(1.5版本之后的内存管理)
private def getMaxMemory(conf: SparkConf): Long = {
val systemMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
val reservedMemory = conf.getLong("spark.testing.reservedMemory",
if (conf.contains("spark.testing")) 0 else RESERVED_SYSTEM_MEMORY_BYTES)
//跟上面的系统内存判断一样,这边系统内存要求是450M
val minSystemMemory = (reservedMemory * 1.5).ceil.toLong
if (systemMemory < minSystemMemory) {
throw new IllegalArgumentException(s"System memory $systemMemory must " +
s"be at least $minSystemMemory. Please increase heap size using the --driver-memory " +
s"option or spark.driver.memory in Spark configuration.")
}
// SPARK-12759 Check executor memory to fail fast if memory is insufficient
if (conf.contains("spark.executor.memory")) {
val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
if (executorMemory < minSystemMemory) {
throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " +
s"$minSystemMemory. Please increase executor memory using the " +
s"--executor-memory option or spark.executor.memory in Spark configuration.")
}
}
//10G - 300M
val usableMemory = systemMemory - reservedMemory
val memoryFraction = conf.getDouble("spark.memory.fraction", 0.6)
//(10G - 300M)*0.6 执行存储的共用内存
(usableMemory * memoryFraction).toLong
}
def apply(conf: SparkConf, numCores: Int): UnifiedMemoryManager = {
val maxMemory = getMaxMemory(conf)
new UnifiedMemoryManager(
conf,
maxHeapMemory = maxMemory,
onHeapStorageRegionSize =
//(10G - 300M)*0.6*0.5 存储的使用内存
(maxMemory * conf.getDouble("spark.memory.storageFraction", 0.5)).toLong,
numCores = numCores)
}
两种不同的内存管理方式的占比情况
9.GC垃圾回收
http://spark.apache.org/docs/latest/tuning.html#memory-tuning
minor GC
full GC