Partition by Hash on Keys

bzhangusc

于 2015-01-24 13:49:55 发布

阅读量487

点赞数

CC 4.0 BY-SA版权

文章标签： spark

本文链接：https://blog.youkuaiyun.com/bzhangusc/article/details/43084853

本文介绍如何在Spark中通过合理分区和利用Gzip编码，优化数据集的处理效率并减小存储占用。具体操作包括创建分区器、映射数据、应用分区策略以及保存文件时使用Gzip压缩，以实现高效且节省资源的数据处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

When an RDD object is created, it will partitioned to multiple pieces for parallel processing. If we have to join the RDD with other RDDs many times on some Key, we’d better partition the RDDs by the join Key, so all the join operations can be purely local operation.

For following example will show how to do so and also store the RDD back with Gzip encoding.

import
org.apache.spark.HashPartitioner
val
part=new
HashPartitioner(8)
val
m=data.map(line=>(line.take(10),line)).partitionBy(part).values
m.saveAsTextFile(outFile,classOf[org.apache.hadoop.io.compress.GzipCodec])