spark-broadcast

最新推荐文章于 2021-01-31 13:59:05 发布

原创最新推荐文章于 2021-01-31 13:59:05 发布 · 866 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#broadcast #spark

spark 专栏收录该内容

23 篇文章

订阅专栏

本文介绍了Apache Spark中的广播变量概念及其实现方式。广播变量允许程序员将只读数据集缓存在每个节点上，而不是为每个任务复制数据。文章详细解释了如何创建和使用广播变量，并对比了两种不同的实现方式：基于HTTP的广播机制和类似BitTorrent的TorrentBroadcast。

spark-broadcast

@(spark)[broadcast]
Spark’s broadcast variables, used to broadcast immutable datasets to all node

Broadcast

/**                                                                                                                                                                     
 * A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable                                                                          
 * cached on each machine rather than shipping a copy of it with tasks. They can be used, for                                                                           
 * example, to give every node a copy of a large input dataset in an efficient manner. Spark also                                                                       
 * attempts to distribute broadcast variables using efficient broadcast algorithms to reduce                                                                            
 * communication cost.                                                                                                                                                  
 *                                                                                                                                                                      
 * Broadcast variables are created from a variable `v` by calling                                                                                                       
 * [[org.apache.spark.SparkContext#broadcast]].                                                                                                                         
 * The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the                                                                         
 * `value` method. The interpreter session below shows this:                                                                                                            
 *                                                                                                                                                                      
 * {{{                                                                                                                                                                  
 * scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))                                                                                                               
 * broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)                                                                                        
 *                                                                                                                                                                      
 * scala> broadcastVar.value                                                                                                                                            
 * res0: Array[Int] = Array(1, 2, 3)                                                                                                                                    
 * }}}                                                                                                                                                                  
 *                                                                                                                                                                      
 * After the broadcast variable is created, it should be used instead of the value `v` in any                                                                           
 * functions run on the cluster so that `v` is not shipped to the nodes more than once.                                                                                 
 * In addition, the object `v` should not be modified after it is broadcast in order to ensure                                                                          
 * that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped                                                                         
 * to a new node later).                                                                                                                                                
 *                                                                                                                                                                      
 * @param id A unique identifier for the broadcast variable.                                                                                                            
 * @tparam T Type of the data contained in the broadcast variable.                                                                                                      
 */                                                                                                                                                                     
abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable with Logging {    



/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * An interface for all the broadcast implementations in Spark (to allow                                                                                                
 * multiple broadcast implementations). SparkContext uses a user-specified                                                                                              
 * BroadcastFactory implementation to instantiate a particular broadcast for the                                                                                        
 * entire Spark job.                                                                                                                                                    
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
trait BroadcastFactory {

目前有两组实现，默认的是后者

HttpBroadcast

/**                                                                                                                                                                     
 * A [[org.apache.spark.broadcast.Broadcast]] implementation that uses HTTP server                                                                                      
 * as a broadcast mechanism. The first time a HTTP broadcast variable (sent as part of a                                                                                
 * task) is deserialized in the executor, the broadcasted data is fetched from the driver                                                                               
 * (through a HTTP server running at the driver) and stored in the BlockManager of the                                                                                  
 * executor to speed up future accesses.                                                                                                                                
 */                                                                                                                                                                     
private[spark] class HttpBroadcast[T: ClassTag](

TorrentBroadcast

/**                                                                                                                                                                     
 * A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]].                                                                                        
 *                                                                                                                                                                      
 * The mechanism is as follows:                                                                                                                                         
 *                                                                                                                                                                      
 * The driver divides the serialized object into small chunks and                                                                                                       
 * stores those chunks in the BlockManager of the driver.                                                                                                               
 *                                                                                                                                                                      
 * On each executor, the executor first attempts to fetch the object from its BlockManager. If                                                                          
 * it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or                                                                      
 * other executors if available. Once it gets the chunks, it puts the chunks in its own                                                                                 
 * BlockManager, ready for other executors to fetch from.                                                                                                               
 *                                                                                                                                                                      
 * This prevents the driver from being the bottleneck in sending out multiple copies of the                                                                             
 * broadcast data (one per executor) as done by the [[org.apache.spark.broadcast.HttpBroadcast]].                                                                       
 *                                                                                                                                                                      
 * When initialized, TorrentBroadcast objects read SparkEnv.get.conf.                                                                                                   
 *                                                                                                                                                                      
 * @param obj object to broadcast                                                                                                                                       
 * @param id A unique identifier for the broadcast variable.                                                                                                            
 */                                                                                                                                                                     
private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long)                                                                                                    
  extends Broadcast[T](id) with Logging with Serializable {