Streaming源码解读 updateStateByKey-1

最新推荐文章于 2021-02-05 12:12:18 发布

转载最新推荐文章于 2021-02-05 12:12:18 发布 · 121 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://my.oschina.net/corleone/blog/682724

文章标签：

#scala #大数据 #python

本文详细解析了Spark Streaming中的updateStateByKey功能，通过一个持续更新单词计数的案例，介绍了如何实现增量更新并保持状态。同时，文章还探讨了StateDStream的构造方式以及其实现原理。

2019独角兽企业重金招聘Python工程师标准>>>

本文重点关注updateStateByKey，假设spark shell启动没问题，明白Receiver和RDD生成的过程，不明白的建议从此文开始学习，或关注YY课堂：每天20:00免费现场授课频道68917580。

案例：累计单词出现的次数，持续性的更新计数。

因为是持续性的计数，因此比较高效的算法是计算完一批数据之后将每个单词的计数保存起来，在下一批数据来之后，再做增量更新。

先在终端运行：

root@master:~# nc -lk 9999

启动spark-shell，输入如下代码：

// 累计单词出现的次数，持续性的更新计数
sc.setCheckpointDir(".") // 设置checkpoint
import org.apache.spark.streaming.{Durations, StreamingContext}
import org.apache.spark.{SparkContext, SparkConf}
// 设置5秒收割一次数据
val ssc = new StreamingContext(sc, Durations.seconds(5L))
// 接收socket 9999端口数据
ssc.socketTextStream("localhost", 9999).
flatMap(_.split(" ")).
map((_, 1)).
updateStateByKey((once, total: Option[Int]) => Some(once.sum + total.getOrElse(0))).
print
ssc.start

在每隔5秒输入一行数据：

输出结果如下：

实例演示后，看看源码吧。

按照前文还原DAG的方法，最终的DStream的DAG如下：

SocketInputDStream -> FlatMappedDStream -> MappedDStream -> StateDStream -> ForEachDStream

而StateDStream的创建是调用MappedDStream.updateStateByKey。

// PairDStreamFunctions.scala line 396 Spark 1.6.0
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S]
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner())
  }
// PairDStreamFunctions.scala line 428 
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S],
      partitioner: Partitioner
    ): DStream[(K, S)] = ssc.withScope {
    val cleanedUpdateF = sparkContext.clean(updateFunc)
    val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
      iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
    } 
    updateStateByKey(newUpdateFunc, partitioner, true)
  }

简单的实例化了StateDStream

// PairDStreamFunctions.scala line 452
  def updateStateByKey[S: ClassTag](
      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
      partitioner: Partitioner,
      rememberPartitioner: Boolean
    ): DStream[(K, S)] = ssc.withScope {
     new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)
  }

细心的读者会发现，MappedDStream及其父类都没有updateStateByKey方法。

这时第一个想到的就是隐式转换。不太了解的朋友可以关注YY课堂：每天20:00免费现场授课频道68917580。

// PairDStreamFunctions.scala line 37
class PairDStreamFunctions[K, V](self: DStream[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K])
  extends Serializable

有细心的读者会有疑问，为什么输入的函数

(once, total: Option[Int]) => Some(once.sum + total.getOrElse(0))

在调用中，要再次封装成

// PairDStreamFunctions.scala line 433
    val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
      iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
    }

为什么呢？这里先卖个关子。心急的读者可以先行找答案，并给我留言。

接下来了解下StateDStream的构造，与其他DStream不同的是，mustCheckpoint是true。因为状态需要保存，当然需要checkpoint啦。

// StateDStream.scala line 43
  override val mustCheckpoint = true

与其他的DStream不同的是，除了有compute，还有computeUsingPreviousRDD。

// StateDStream.scala line 64
override def compute(validTime: Time): Option[RDD[(K, S)]]

// StateDStream.scala line 45
private [this] def computeUsingPreviousRDD

这里留第二个关子。

依据笔者此前的源码分析，生成Job会从outputStreams中开始回溯。

本案例中是从ForEachDStream开始，当调用到StateDStream.compute(timeTime)时，是本章重点分析的关键。

在深入代码前，先有个大致思路：

从DAG维度看当前StateDStream，创建RDD需要依赖父DStream。
从时间维度看，当前batch的DStream（本章聚焦的是StateDStream）创建RDD，需要RDD模版（StateDStream）前一个batch创建的RDD（保存历史状态的RDD）和当前batch创建的RDD，再次按key聚合。而前一个batch创建的RDD在创建时又会依赖它前一个的batch创建的RDD。其实是递归。

因此当前batch的结果一定是由两部分组成，保存历史状态的RDD和当前batch创建的RDD。

而StateDStream在时间维度上的递归依赖在首次时是返回空的。也就是首次是没有上一个batch的。

当然，当StateDStream创建后，就会自动存在当前DStream的数据结构中，下次取的时候也就可以直接获取，不用在时间维度上向前依赖了。然而RDD是随着时间一直创建的，显然不会一直保存。何时会清理呢？请关注本节内容。

// DStream.scala line 366
generatedRDDs.put(time, newRDD) // 之前已经put过

在本案例中，状态就是单词和对应的计数。

直接在分支中插入代码分析。

getOrCompute(validTime-slidDuration)：本次时间减去滑动时间，也就是取前一个时间段触发的StateDStream创建的RDD。
- None：首次启动，time = zeroTime，因此返回None。也就是说，从时间维度来看，历史的状态不存在。
```
// DStream.scala line 341
generatedRDDs.get(time).orElse { // 首次肯定没有，执行代码块
    if(isTimeValid(time)){ // 第一次启动时，time=zeroTime，因此返回false。
    // 一些逻辑代码
    }else{
        None
    }

// DStream.scala line 321
  private[streaming] def isTimeValid(time: Time): Boolean = {
    if (!isInitialized) {
      throw new SparkException (this + " has not been initialized")
    } else if (time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)) {
// 首次触发 time = zeroTime，故返回false
      logInfo("Time " + time + " is invalid as zeroTime is " + zeroTime +
        " and slideDuration is " + slideDuration + " and difference is " + (time - zeroTime))
      false
    } else {
      logDebug("Time " + time + " is valid")
      true
    }
  }
```
  - parent.getOrCompute(validTime)：取依赖的父DStream的RDD。维度回到当前batch的维度。
    - Some(parentRDD)：若有父RDD，说明还有依赖。
      - initialRDD：再看有没有初始的RDD。因为有时初始值不一定是从0开始。
        None：若没有初始值，直接创建RDD。
        Some(initialStateRDD)：若有初始值。
        computeUsingPreviousRDD (parentRDD, prevStateRDD)：父RDD与初始RDD cogroup操作
    - None：若没有父RDD，do nothing。
- Some(prevStateRDD)：之前创建过的RDD，直接取。
```
// DStream.scala line 366
generatedRDDs.put(time, newRDD) // 之前已经put过
```
  - parent.getOrCompute(validTime)：取依赖的父DStream的RDD。维度回到当前batch的维度。
    - Some(parentRDD)：有父RDD。
      - computeUsingPreviousRDD (parentRDD, prevStateRDD)：父RDD与初始RDD cogroup操作
    - None：意味着没有新数据流入
      - prevStateRDD.mapPartitions(finalFunc, preservePartitioning)：使用上个batch的数据计算。

再整体看下代码，是不是很清楚。

// StateDStream.scala line 64
  override def compute(validTime: Time): Option[RDD[(K, S)]] = {

    // Try to get the previous state RDD
    getOrCompute(validTime - slideDuration) match {// line 67

      case Some(prevStateRDD) => {    // If previous state RDD exists ,line 69

        // Try to get the parent RDD
        parent.getOrCompute(validTime) match {
          case Some(parentRDD) => {   // If parent RDD exists, then compute as usual
            computeUsingPreviousRDD (parentRDD, prevStateRDD)
          }
          case None => {    // If parent RDD does not exist, line 76

            // Re-apply the update function to the old state RDD
            val updateFuncLocal = updateFunc
            val finalFunc = (iterator: Iterator[(K, S)]) => {
              val i = iterator.map(t => (t._1, Seq[V](), Option(t._2)))
              updateFuncLocal(i)
            }
            val stateRDD = prevStateRDD.mapPartitions(finalFunc, preservePartitioning)
            Some(stateRDD)
          }
        }
      }

      case None => {    // If previous session RDD does not exist (first input data), line 90

        // Try to get the parent RDD
        parent.getOrCompute(validTime) match {
          case Some(parentRDD) => {   // If parent RDD exists, then compute as usual
            initialRDD match {
              case None => {
                // Define the function for the mapPartition operation on grouped RDD;
                // first map the grouped tuple to tuples of required type,
                // and then apply the update function
                val updateFuncLocal = updateFunc
                val finalFunc = (iterator : Iterator[(K, Iterable[V])]) => {
                  updateFuncLocal (iterator.map (tuple => (tuple._1, tuple._2.toSeq, None)))
                }

                val groupedRDD = parentRDD.groupByKey (partitioner)
                val sessionRDD = groupedRDD.mapPartitions (finalFunc, preservePartitioning)
                // logDebug("Generating state RDD for time " + validTime + " (first)")
                Some (sessionRDD)
              }
              case Some (initialStateRDD) => {
                computeUsingPreviousRDD(parentRDD, initialStateRDD)
              }
            }
          }
          case None => { // If parent RDD does not exist, then nothing to do!
            // logDebug("Not generating state RDD (no previous state, no parent)")
            None
          }
        }
      }
    }
  }

那么看computeUsingPreviousRDD方法，其实就是将两个RDD进行cogroup，形成一个RDD。

因为状态管理一定是历史状态（历史所有的batch）和当前状态（当前batch）的合并，因此这里就多了一个时间维度，而RDD按照lineage回溯只能有一个RDD，是没有时间维度的RDD，因此需要降维，将时间维度合并成时间对齐的状态的RDD。

// StateDStream.scala line 45
  private [this] def computeUsingPreviousRDD (
    parentRDD : RDD[(K, V)], prevStateRDD : RDD[(K, S)]) = {
    // Define the function for the mapPartition operation on cogrouped RDD;
    // first map the cogrouped tuple to tuples of required type,
    // and then apply the update function
    val updateFuncLocal = updateFunc
    val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {
      val i = iterator.map(t => {
        val itr = t._2._2.iterator
        val headOption = if (itr.hasNext) Some(itr.next()) else None
        (t._1, t._2._1.toSeq, headOption)
      })
      updateFuncLocal(i)
    }
    val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)
    val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)
    Some(stateRDD)
  }

cogroup会将所有的key都取出来，并没有将增量的部分循环，因此若非要返回全量信息，不建议使用updateStateByKey。建议使用mapWithState，具体用法见下一篇介绍。

下一节分析 mapWithState

转载于:https://my.oschina.net/corleone/blog/682724