spark-broadcast

本文介绍了Apache Spark中的广播变量概念及其实现方式。广播变量允许程序员将只读数据集缓存在每个节点上,而不是为每个任务复制数据。文章详细解释了如何创建和使用广播变量,并对比了两种不同的实现方式:基于HTTP的广播机制和类似BitTorrent的TorrentBroadcast。

spark-broadcast

@(spark)[broadcast]
Spark’s broadcast variables, used to broadcast immutable datasets to all node

Broadcast

/**                                                                                                                                                                     
 * A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable                                                                          
 * cached on each machine rather than shipping a copy of it with tasks. They can be used, for                                                                           
 * example, to give every node a copy of a large input dataset in an efficient manner. Spark also                                                                       
 * attempts to distribute broadcast variables using efficient broadcast algorithms to reduce                                                                            
 * communication cost.                                                                                                                                                  
 *                                                                                                                                                                      
 * Broadcast variables are created from a variable `v` by calling                                                                                                       
 * [[org.apache.spark.SparkContext#broadcast]].                                                                                                                         
 * The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the                                                                         
 * `value` method. The interpreter session below shows this:                                                                                                            
 *                                                                                                                                                                      
 * {{{                                                                                                                                                                  
 * scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))                                                                                                               
 * broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)                                                                                        
 *                                                                                                                                                                      
 * scala> broadcastVar.value                                                                                                                                            
 * res0: Array[Int] = Array(1, 2, 3)                                                                                                                                    
 * }}}                                                                                                                                                                  
 *                                                                                                                                                                      
 * After the broadcast variable is created, it should be used instead of the value `v` in any                                                                           
 * functions run on the cluster so that `v` is not shipped to the nodes more than once.                                                                                 
 * In addition, the object `v` should not be modified after it is broadcast in order to ensure                                                                          
 * that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped                                                                         
 * to a new node later).                                                                                                                                                
 *                                                                                                                                                                      
 * @param id A unique identifier for the broadcast variable.                                                                                                            
 * @tparam T Type of the data contained in the broadcast variable.                                                                                                      
 */                                                                                                                                                                     
abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable with Logging {    



/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * An interface for all the broadcast implementations in Spark (to allow                                                                                                
 * multiple broadcast implementations). SparkContext uses a user-specified                                                                                              
 * BroadcastFactory implementation to instantiate a particular broadcast for the                                                                                        
 * entire Spark job.                                                                                                                                                    
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
trait BroadcastFactory {   

目前有两组实现,默认的是后者

HttpBroadcast

/**                                                                                                                                                                     
 * A [[org.apache.spark.broadcast.Broadcast]] implementation that uses HTTP server                                                                                      
 * as a broadcast mechanism. The first time a HTTP broadcast variable (sent as part of a                                                                                
 * task) is deserialized in the executor, the broadcasted data is fetched from the driver                                                                               
 * (through a HTTP server running at the driver) and stored in the BlockManager of the                                                                                  
 * executor to speed up future accesses.                                                                                                                                
 */                                                                                                                                                                     
private[spark] class HttpBroadcast[T: ClassTag](     

TorrentBroadcast

/**                                                                                                                                                                     
 * A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]].                                                                                        
 *                                                                                                                                                                      
 * The mechanism is as follows:                                                                                                                                         
 *                                                                                                                                                                      
 * The driver divides the serialized object into small chunks and                                                                                                       
 * stores those chunks in the BlockManager of the driver.                                                                                                               
 *                                                                                                                                                                      
 * On each executor, the executor first attempts to fetch the object from its BlockManager. If                                                                          
 * it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or                                                                      
 * other executors if available. Once it gets the chunks, it puts the chunks in its own                                                                                 
 * BlockManager, ready for other executors to fetch from.                                                                                                               
 *                                                                                                                                                                      
 * This prevents the driver from being the bottleneck in sending out multiple copies of the                                                                             
 * broadcast data (one per executor) as done by the [[org.apache.spark.broadcast.HttpBroadcast]].                                                                       
 *                                                                                                                                                                      
 * When initialized, TorrentBroadcast objects read SparkEnv.get.conf.                                                                                                   
 *                                                                                                                                                                      
 * @param obj object to broadcast                                                                                                                                       
 * @param id A unique identifier for the broadcast variable.                                                                                                            
 */                                                                                                                                                                     
private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long)                                                                                                    
  extends Broadcast[T](id) with Logging with Serializable {

随机选远程节点这个事情,是由blockManger完成的

### Spark SQL 中 Broadcast Join 的使用方法及 SQL 实例 #### 什么是 Broadcast JoinBroadcast Join 是一种优化策略,在 Spark SQL 中用于处理小表和大表之间的连接操作。其核心思想是将较小的表广播到集群的所有节点上,使得每个节点都可以在本地完成数据的连接操作,而无需通过 shuffle 来移动大量数据[^4]。 当满足以下条件之一时,Spark 将会尝试应用 Broadcast Join: 1. 显式指定 `BROADCAST` 提示(hint),强制将某个表作为广播对象。 2. 自动广播机制生效,即如果某张表的大小小于 `spark.sql.autoBroadcastJoinThreshold` 配置值,则自动触发广播行为[^4]。 --- #### 如何启用 Broadcast Join? 可以通过设置参数来控制 Broadcast Join 的行为: ```sql SET spark.sql.autoBroadcastJoinThreshold=10m; ``` 默认情况下,`autoBroadcastJoinThreshold` 设置为 10 MB(单位可以是 KB/MB)。如果将其设为 `-1`,则禁用自动广播功能[^5]。 --- #### Broadcast Join 的语法与实例 以下是几种常见的 Broadcast Join 使用方式及其对应的 SQL 示例: ##### 方法一:显式指定 BROADCAST Hint 可以在查询中手动添加 `/*+ BROADCAST(table_name) */` 提示,告诉 Spark 对特定的小表进行广播。 **SQL 示例:** ```sql SELECT /*+ BROADCAST(small_table) */ big_table.* FROM big_table JOIN small_table ON big_table.id = small_table.id; ``` 此语句表示将 `small_table` 广播至所有执行节点,并与 `big_table` 进行本地连接[^1]。 --- ##### 方法二:隐式依赖 autoBroadcastJoinThreshold 参数 如果不希望手动干预,也可以依靠 Spark 的自动判断机制。只需确保目标表的大小低于阈值即可。 **SQL 示例:** ```sql -- 假设 small_table 大于 10M,但小于当前 threshold (e.g., 1G) SET spark.sql.autoBroadcastJoinThreshold=1g; SELECT * FROM big_table JOIN small_table ON big_table.id = small_table.id; ``` 在这种场景下,即使未提供任何提示,Spark 可能仍然会选择 Broadcast Join 方案[^3]。 --- ##### 方法三:结合复杂查询场景 对于更复杂的查询结构,同样支持嵌套子查询或者多层 JOIN 场景下的广播优化。 **SQL 示例:** ```sql SELECT /*+ BROADCAST(subquery_result) */ main_table.* FROM ( SELECT id, value FROM intermediate_results WHERE condition = true ) AS subquery_result JOIN main_table ON main_table.id = subquery_result.id; ``` 这里展示了如何对中间结果集(subquery)实施广播优化[^2]。 --- #### 性能调优建议 为了充分发挥 Broadcast Join 的优势,请注意以下几点: 1. **合理调整广播阈值**:根据实际业务需求修改 `spark.sql.autoBroadcastJoinThreshold` 参数,避免因过低或过高而导致性能下降[^5]。 2. **监控运行计划**:利用 `EXPLAIN` 查看物理执行计划,确认是否成功启用了 Broadcast Hash Join[^4]。 ```sql EXPLAIN EXTENDED SELECT /*+ BROADCAST(small_table) */ big_table.* FROM big_table JOIN small_table ON big_table.id = small_table.id; ``` 3. **评估数据分布**:确保被广播的表确实足够小;否则可能引发内存溢出等问题[^5]。 --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值