Spark’s DoubleRDDFunctions provide a histogram function for RDD[Double]. However there are no histogram function for RDD[String]. Here is a quick exercise for doing it. We will use immutable Map in this exercise.
Create a dummy RDD[String] and apply the aggregate method to calculate histogram
1
2
3
4
5
|
scala>
val
d = sc.parallelize(( 1
to 10 ).map( _
%
3 ).map( "val" + _ .toString)) scala> d.aggregate(Map[String,Int]())( | (m,c) = >m.updated(c,m.getOrElse(c, 0 )+ 1 ), | (m,n) = >(m / :
n){ case
(map,(k,v)) = >map.updated(k,v+map.getOrElse(k, 0 ))} | ) |
The 2nd function of aggregate method is to merge 2 maps. We can actually define a Scala function
1
2
3
|
scala>
def
mapadd[T](m : Map[T,Int],n : Map[T,Int]) = { | (m / :
n){ case
(map,(k,v)) = >map.updated(k,v+map.getOrElse(k, 0 ))} | } |
It combine the histogram on the different partitions together
1
2
|
scala> mapadd(Map( "a" -> 1 , "b" -> 2 ),Map( "a" -> 2 , "c" -> 1 )) res 3 :
scala.collection.mutable.Map[String,Int]
=
Map(b -> 2 , a ->
3 , c ->
1 ) |
Use mapadd we can rewrite the aggregate step
1
2
3
4
|
scala> d.aggregate(Map[String,Int]())( | (m,c) = >m.updated(c,m.getOrElse(c, 0 )+ 1 ), | mapadd( _ , _ ) | ) |