When an RDD object is created, it will partitioned to multiple pieces for parallel processing. If we have to join the RDD with other RDDs many times on some Key, we’d better partition the RDDs by the join Key, so all the join operations can be purely local operation.
For following example will show how to do so and also store the RDD back with Gzip encoding.
1
2
3
4
|
import
org.apache.spark.HashPartitioner val
part = new
HashPartitioner( 8 ) val
m = data.map(line = >(line.take( 10 ),line)).partitionBy(part).values m.saveAsTextFile(outFile,classOf[org.apache.hadoop.io.compress.GzipCodec]) |
Here we assume “data” is an RDD[String] and the key of each record is he first 10 bytes. 8 partitions were created and wrote back with compression.