Spark官网给出的几种调优点其中有一条是数据序列化
1.数据序列化,数据序列化
1)java序列化
2)kyro序列化(qucikly compact)
注册使用,不注册性能相反使用kryo的三种方式:
1)代码中增加conf.set (“spark .serializer”,“org.apache.spark.serializer.KryoSerializer”)
2)spark-default.conf中进行配置
3)提交作业时–conf key = value的形式添加,优先级比conf中设置高
指定完成后,需要对序列化的类进行注册conf.registerKryoClasses(Array(classOf [MyClass1],classOf [MyClass2]))
有几个类就注册几个类,MyClass1,MyClass2即为要注册的类
对以下几种情况进行测试
1.不进行序列化,34.3MB
2.java序列化,25.1MB
3.使用kyro序列化,但不注册类,40.2MB
4.使用kyro序列化,并注册类,21.1MB
综上,kyro序列化需要注册对应的类,如不注册,性能最糟,甚至不如不序列化
除了上边这种情况,序列化后总体要比不序列化好。
测试代码如下,源于https://blog.youkuaiyun.com/lsshlsw/article/details/50856842
import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Random
import scala.collection.mutable.ArrayBuffer
case class Info(name: String ,age: Int,gender: String,addr: String)
object SerializeCompare {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("KyroTest")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[Info]))
val sc = new SparkContext(conf)
val arr = new ArrayBuffer[Info]()
val nameArr = Array[String]("lsw","yyy","lss")
val genderArr = Array[String]("male","female")
val addressArr = Array[String]("beijing","shanghai","shengzhen","wenzhou","hangzhou")
for(i <- 1 to 1000000){
val name = nameArr(Random.nextInt(3))
val age = Random.nextInt(100)
val gender = genderArr(Random.nextInt(2))
val address = addressArr(Random.nextInt(5))
arr.+=(Info(name,age,gender,address))
}
val rdd = sc.parallelize(arr)
//序列化的方式将rdd存到内存
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
rdd.count()
}
}