how to deal with error SPARK-5063 in spark



I get the error message SPARK-5063 in the line of println

val d.foreach{x=> for(i<-0 until x.length)
      println(m.lookup(x(i)))}    

d is RDD[Array[String]] m is RDD[(String, String)] . Is there any way to print as the way I want? or how can i convert d from RDD[Array[String]] to Array[String] ?

share improve this question
 
 
I think you can achieve it by broadcasting m –  shekhar  Apr 23 at 7:34
 
@sb' can you tell me how to broadcasting m? i am quite fresh to spark and scala. thank you. –  G_cy  Apr 23 at 7:48
 

1 Answer

up vote 2 down vote accepted

SPARK-5063 relates to better error messages when trying to nest RDD operations, which is not supported.

It's a usability issue, not a functional one. The root cause is the nesting of RDD operations and the solution is to break that up.

Here we are trying a join of dRDD and mRDD. If the size of mRDD is large, a rdd.join would be the recommended way otherwise, if mRDD is small, i.e. fits in memory of each executor, we could collect it, broadcast it and do a 'map-side' join.

JOIN

A simple join would go like this:

val rdd = sc.parallelize(Seq(Array("one","two","three"), Array("four", "five", "six")))
val map = sc.parallelize(Seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6))
val flat = rdd.flatMap(_.toSeq).keyBy(x=>x)
val res = flat.join(map).map{case (k,v) => v}

If we would like to use broadcast, we first need to collect the value of the resolution table locally in order to b/c that to all executors. NOTE the RDD to be broadcasted MUST fit in the memory of the driver as well as of each executor.

Map-side JOIN with Broadcast variable

val rdd = sc.parallelize(Seq(Array("one","two","three"), Array("four", "five", "six")))
val map = sc.parallelize(Seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6)))
val bcTable = sc.broadcast(map.collectAsMap)
val res2 = rdd.flatMap{arr => arr.map(elem => (elem, bcTable.value(elem)))} 
share improve this answer
 
 
both two strategy will give out error message. the first gives out not found value: keyBy and missing arguments for method identity in object Predef; follow this method with '_' if you want to treat it as a partially applied function .The second one gives out type mismatch, elem should be String but it is Array[String] now. –  G_cy  Apr 23 at 18:44 
 
@G_cy Right - few issues from typing directly on this interface. Here you go... –  maasg  Apr 23 at 19:56
 
here is the problem. The first, join one works. But the second not. It will give error messagejava.util.NoSuchElementException: key not found: null when i try to do foreach(println). But I can get result with first method, so it should be able to find the keys. –  G_cy  Apr 23 at 21:04 
 
I tested it on the spark shell - what's the issue? At least, do you get the idea? –  maasg  Apr 23 at 21:07
 
I get the idea. and a little confused with the broadcast concept. –  G_cy  Apr 23 at 21:30

I get the error message SPARK-5063 in the line of println

val d.foreach{x=> for(i<-0 until x.length)
      println(m.lookup(x(i)))}    

d is RDD[Array[String]] m is RDD[(String, String)] . Is there any way to print as the way I want? or how can i convert d from RDD[Array[String]] to Array[String] ?

share improve this question
 
   
I think you can achieve it by broadcasting m –  shekhar  Apr 23 at 7:34
   
@sb' can you tell me how to broadcasting m? i am quite fresh to spark and scala. thank you. –  G_cy  Apr 23 at 7:48
   

1 Answer

up vote 2 down vote accepted

SPARK-5063 relates to better error messages when trying to nest RDD operations, which is not supported.

It's a usability issue, not a functional one. The root cause is the nesting of RDD operations and the solution is to break that up.

Here we are trying a join of dRDD and mRDD. If the size of mRDD is large, a rdd.join would be the recommended way otherwise, if mRDD is small, i.e. fits in memory of each executor, we could collect it, broadcast it and do a 'map-side' join.

JOIN

A simple join would go like this:

val rdd = sc.parallelize(Seq(Array("one","two","three"), Array("four", "five", "six")))
val map = sc.parallelize(Seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6))
val flat = rdd.flatMap(_.toSeq).keyBy(x=>x)
val res = flat.join(map).map{case (k,v) => v}

If we would like to use broadcast, we first need to collect the value of the resolution table locally in order to b/c that to all executors. NOTE the RDD to be broadcasted MUST fit in the memory of the driver as well as of each executor.

Map-side JOIN with Broadcast variable

val rdd = sc.parallelize(Seq(Array("one","two","three"), Array("four", "five", "six")))
val map = sc.parallelize(Seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6)))
val bcTable = sc.broadcast(map.collectAsMap)
val res2 = rdd.flatMap{arr => arr.map(elem => (elem, bcTable.value(elem)))} 
share improve this answer
 
   
both two strategy will give out error message. the first gives out not found value: keyBy and missing arguments for method identity in object Predef; follow this method with '_' if you want to treat it as a partially applied function .The second one gives out type mismatch, elem should be String but it is Array[String] now. –  G_cy  Apr 23 at 18:44 
   
@G_cy Right - few issues from typing directly on this interface. Here you go... –  maasg  Apr 23 at 19:56
   
here is the problem. The first, join one works. But the second not. It will give error messagejava.util.NoSuchElementException: key not found: null when i try to do foreach(println). But I can get result with first method, so it should be able to find the keys. –  G_cy  Apr 23 at 21:04 
   
I tested it on the spark shell - what's the issue? At least, do you get the idea? –  maasg  Apr 23 at 21:07
   
I get the idea. and a little confused with the broadcast concept. –  G_cy  Apr 23 at 21:30
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值