closure:
这种代码是错误的:
var counter = 0
var rdd = sc.parallelize(data)
// Wrong: Don't do this!!
rdd.foreach(x => counter += x)
println("Counter value: " + counter)
closure :To execute jobs, Spark breaks up the processing of RDD operations into
tasks, each of which is executed by an executor. Prior to execution, Spark
computes the task’s closure. The closure is those variables and methods which
must be visible for the executor to perform its computations on the RDD (in this
case foreach()). This closure is serialized and sent to each executor.
The variables within the closure sent to each executor are now copies and thus,
when counter is referenced within the foreach function, it’s no longer the
counter on the driver node. There is still a counter in the memory of the driver
node but this is no longer visible to the executors! The executors only see the
copy from the serialized closure. Thus, the final value of counter will still be zero
since all operations on counter were referencing the value within the serialized
closure.
上面的代码中的逻辑可以使用Accumulator来完成
Accumulator:
Warning: When a Spark task finishes, Spark will try to merge the accumulated
updates in this task to an accumulator. If it fails, Spark will ignore the failure
and still mark the task successful and continue to run other tasks. Hence, a
buggy accumulator will not impact a Spark job, but it may not get updated
correctly although a Spark job is successful.
//用法:
LongAccumulator accum = jsc.sc().longAccumulator();
sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x -> accum.add(x));
// ...
// 10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
accum.value();
broadcast与accumulator类似,不过broadcast只是作为共享变量不允许被更新。
//用法:
Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3});
本文深入探讨了在Apache Spark中Closure的工作原理及常见陷阱,解释了为什么直接在RDD操作中修改变量会导致错误结果,并介绍了如何使用Accumulator正确实现累加操作。同时,对比了Broadcast变量的作用。
501

被折叠的 条评论
为什么被折叠?



