Spark2.0源码之1_Broadcast

本文深入解析Spark中的广播变量概念及其实现原理。介绍了如何利用SparkContext进行广播变量的创建,并探讨了其在集群环境中高效传输大数据集的应用场景。此外,还详细解释了广播变量的使用限制、生命周期管理及其在确保一致性方面的关键作用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

通过Spark源码中的注释信息理解Spark内核源码。


package org.apache.spark.broadcast

import java.io.Serializable

import scala.reflect.ClassTag

import org.apache.spark.SparkException
import org.apache.spark.internal.Logging
import org.apache.spark.util.Utils

/**
 * 广播变量.广播变量在task执行时, 允许开发人员在每一台机器上缓存一份只读的变量而不是在每一个task中拷贝一份.
 * 广播变量的使用方式, 例如:
  * 给集群中的每一个节点都传输一份大数据量,就可以使用广播变量这种有效方式.
  * Spark采用高效的广播算法尝试降低分布式的广播变量的通信成本
 * A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable
 * cached on each machine rather than shipping a copy of it with tasks. They can be used, for
 * example, to give every node a copy of a large input dataset in an efficient manner. Spark also
 * attempts to distribute broadcast variables using efficient broadcast algorithms to reduce
 * communication cost.
 *
  * 通过SparkContext.broadcast就可以广播需要处理的变量
 * Broadcast variables are created from a variable `v` by calling
 * [[org.apache.spark.SparkContext#broadcast]].
 * The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the
 * `value` method. The interpreter session below shows this:
 *
 * {{{
 * scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
 * broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
 *
 * scala> broadcastVar.value
 * res0: Array[Int] = Array(1, 2, 3)
 * }}}
 *
  * 在广播变量被传输到各个节点以后,这些变量值是不允许被修改的,目的是为了保证每个节点使用此变量时计算出一致的结果.
 * After the broadcast variable is created, it should be used instead of the value `v` in any
 * functions run on the cluster so that `v` is not shipped to the nodes more than once.
 * In addition, the object `v` should not be modified after it is broadcast in order to ensure
 * that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped
 * to a new node later).
 *
 * @param id A unique identifier for the broadcast variable.
 * @tparam T Type of the data contained in the broadcast variable.
 */
abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable with Logging {

  /**
   * Flag signifying whether the broadcast variable is valid
   * (that is, not already destroyed) or not.
   */
  @volatile private var _isValid = true

  private var _destroySite = ""

  /** Get the broadcasted value. */
  def value: T = {
    assertValid()
    getValue()
  }

  /**
    * 异步删除每个executor上的广播变量缓存
   * Asynchronously delete cached copies of this broadcast on the executors.
   * If the broadcast is used after this is called, it will need to be re-sent to each executor.
   */
  def unpersist() {
    unpersist(blocking = false)
  }

  /**
   * Delete cached copies of this broadcast on the executors. If the broadcast is used after
   * this is called, it will need to be re-sent to each executor.
   * @param blocking Whether to block until unpersisting has completed
   */
  def unpersist(blocking: Boolean) {
    assertValid()
    doUnpersist(blocking)
  }


  /**
    * 销毁广播变量的所有数据以及元数据,此方法要谨慎使用。因为一旦使用销毁方法,广播变量将不能使用了。
   * Destroy all data and metadata related to this broadcast variable. Use this with caution;
   * once a broadcast variable has been destroyed, it cannot be used again.
   * This method blocks until destroy has completed
   */
  def destroy() {
    destroy(blocking = true)
  }

  /**
   * Destroy all data and metadata related to this broadcast variable. Use this with caution;
   * once a broadcast variable has been destroyed, it cannot be used again.
   * @param blocking Whether to block until destroy has completed
   */
  private[spark] def destroy(blocking: Boolean) {
    assertValid()
    _isValid = false
    _destroySite = Utils.getCallSite().shortForm
    logInfo("Destroying %s (from %s)".format(toString, _destroySite))
    doDestroy(blocking)
  }

  /**
   * Whether this Broadcast is actually usable. This should be false once persisted state is
   * removed from the driver.
   */
  private[spark] def isValid: Boolean = {
    _isValid
  }

  /**
   * Actually get the broadcasted value. Concrete implementations of Broadcast class must
   * define their own way to get the value.
   */
  protected def getValue(): T

  /**
   * Actually unpersist the broadcasted value on the executors. Concrete implementations of
   * Broadcast class must define their own logic to unpersist their own data.
   */
  protected def doUnpersist(blocking: Boolean)

  /**
   * Actually destroy all data and metadata related to this broadcast variable.
   * Implementation of Broadcast class must define their own logic to destroy their own
   * state.
   */
  protected def doDestroy(blocking: Boolean)

  /** Check if this broadcast is valid. If not valid, exception is thrown. */
  protected def assertValid() {
    if (!_isValid) {
      throw new SparkException(
        "Attempted to use %s after it was destroyed (%s) ".format(toString, _destroySite))
    }
  }

  override def toString: String = "Broadcast(" + id + ")"
}

### Spark SQL Broadcast Join 的使用方法及示例 #### 什么是 Broadcast Join? Broadcast Join 是一种优化技术,在 Spark SQL 中用于处理较小的表参与 Join 操作的情况。当一张表非常小时,可以将其广播到集群中的每个节点上,而不是通过 Shuffle 来分发数据[^1]。 #### 配置参数说明 以下是与 Broadcast Join 相关的关键配置参数及其作用: - **`spark.sql.autoBroadcastJoinThreshold`**: 控制小表自动广播的阈值(单位为字节)。默认值为 `10MB` (即 `10485760`)。如果设置为 `-1`,则禁用自动广播功能。 - **`spark.sql.broadcastTimeout`**: 设置广播操作的最大等待时间(单位为毫秒),超过该时间会抛出异常。默认值通常为 `300 秒` 或者更长时间。 - **`spark.sql.adaptive.autoBroadcastJoinThreshold`**: 自适应执行框架下的广播阈值。此参数允许动态调整广播行为,适用于复杂场景下进一步提升性能[^1]。 #### 示例代码 下面是一个完整的例子展示如何手动启用 Broadcast Join 和其实际应用方式: ```scala import org.apache.spark.sql.functions._ // 假设有两个 DataFrame A 和 B val dfA = spark.read.format("csv").option("header", "true").load("/path/to/large_table.csv") val dfB = spark.read.format("csv").option("header", "true").load("/path/to/small_table.csv") // 手动指定广播提示 dfA.join(broadcast(dfB), Seq("key"), "inner") .show() // 如果希望完全依赖系统判断,则可以通过修改配置来实现自动化 spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760") // 单位为字节数 ``` 在此案例中,我们假设 `dfB` 表格大小远小于设定好的广播上限,因此适合采用广播机制加速计算过程;而较大的表格保持不变继续走常规路径完成关联运算逻辑[^2]。 另外需要注意的是,并非所有的连接类型都支持广播模式。例如左外联接(`LEFT OUTER JOIN`)只允许右侧关系作为候选对象进行广播传输[^2]。 最后提醒一点关于分区数量的影响因素——`spark.sql.shuffle.partitions` 参数决定了整个作业过程中产生的 shuffle 文件数目多少,间接也会影响到最终结果集生成效率问题所在之处[^1]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

向往的生活Life

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值