- sequenceFile
sc.sequenceFile[BytesWritable,String] (“hdfs://hadoop000:8020/user/hive/warehouse/”) key的类型建议用BytesWritable - 序列化(性能的重要角色)在hadoop里只要有Writable,包含两种序列化
-
java serialization(性能不高,比较慢)
会导致更大的序列格式 -
kryo serialization(快,紧凑)
比java快10x不支持所有的序列化类型,需要注册类,j经常有网络通信的时候优先考虑conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
使用之前需要注册
val conf = new SparkConf().setMaster(...).setAppName(...).set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf)
-
java serialization与kryo serialization测试对比
Finally, if you don’t register your custom classes, Kryo will still work, but it will have to store the full class name with each object, which is wasteful.case class Student(id:Int,name:String,age:Int) object SerialiZationApp { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .setMaster("local[2]") .setAppName("SerialiZationApp") //假如没有下面的配置默认使用javaSerialiZation // kyro .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .registerKryoClasses(Array(classOf[Student])) val sc = new SparkContext(sparkConf) val students = ListBuffer[Student]() for (i <- 0 to(1000000)){ students.append(Student(i,"kaola_"+i,25)) } val studentRdd = sc.parallelize(students) //javaSerialiZation 32MB //kyro 没注册的情况下是52.8MB //kyro 注册之后是23MB studentRdd.persist(StorageLevel.MEMORY_ONLY_SER) studentRdd.count() Thread.sleep(1000*20) sc.stop() } }
-
序列化在spark中使用的场景
-
算子里面使用到外部变量
- shuffle
- cache
-
spark-core05序列化、
最新推荐文章于 2024-11-24 18:49:02 发布