When an RDD object is created, it will partitioned to multiple pieces for parallel processing. If we have to join the RDD with other RDDs many times on some Key, we’d better partition the RDDs by the join Key, so all the join operations can be purely local operation.
For following example will show how to do so and also store the RDD back with Gzip encoding.
|
1
2
3
4
|
import
org.apache.spark.HashPartitionerval
part=new
HashPartitioner(8)val
m=data.map(line=>(line.take(10),line)).partitionBy(part).valuesm.saveAsTextFile(outFile,classOf[org.apache.hadoop.io.compress.GzipCodec]) |
Here we assume “data” is an RDD[String] and the key of each record is he first 10 bytes. 8 partitions were created and wrote back with compression.
本文介绍如何在Spark中通过合理分区和利用Gzip编码,优化数据集的处理效率并减小存储占用。具体操作包括创建分区器、映射数据、应用分区策略以及保存文件时使用Gzip压缩,以实现高效且节省资源的数据处理。
1028

被折叠的 条评论
为什么被折叠?



