尝试spark streaming的有状态转化: updateStateByKey和mapWithState

本文链接：https://blog.youkuaiyun.com/jason_9527/article/details/106387366

streaming wordCount示例

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf

object StreamWordCount {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    val lineStreams = ssc.socketTextStream(localhost, 9999)

    val wordStreams = lineStreams.flatMap(_.split(" "))
    val wordAndOneStreams = wordStreams.map((_, 1))
    val wordAndCountStreams = wordAndOneStreams.reduceByKey(_+_)

    wordAndCountStreams.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

在上面这种方式中，仅仅是对当前批次的word进行统计。

但是在实际的需求中，往往是需要一个累加的操作，需要跨批次的进行累加。

在streaming中，DStream的操作是分成无状态转化和有状态转化的。

无状态转化

无状态转化操作就是把简单的RDD转化操作应用到每个批次上，也就是转化DStream中的每一个RDD。

具体的算子和spark core中基本没有差别，常见算子如下。

Transformation（转换）	Meaning（含义）
map(func)	利用函数 func 处理原 DStream 的每个元素，返回一个新的 DStream。
flatMap(func)	与 map 相似，但是每个输入项可用被映射为 0 个或者多个输出项。。
filter(func)	返回一个新的 DStream，它仅仅包含原 DStream 中函数 func 返回值为 true 的项。
repartition(numPartitions)	通过创建更多或者更少的 partition 以改变这个 DStream 的并行级别（level of parallelism）。
union(otherStream)	返回一个新的 DStream，它包含源 DStream 和 otherDStream 的所有元素。
count()	通过 count 源 DStream 中每个 RDD 的元素数量，返回一个包含单元素（single-element）RDDs 的新 DStream。
reduce(func)	利用函数 func 聚集源 DStream 中每个 RDD 的元素，返回一个包含单元素（single-element）RDDs 的新 DStream。函数应该是相关联的，以使计算可以并行化。
countByValue()	在元素类型为 K 的 DStream上，返回一个（K,long）pair 的新的 DStream，每个 key 的值是在原 DStream 的每个 RDD 中的次数。
reduceByKey(func, [numTasks])	当在一个由 (K,V) pairs 组成的 DStream 上调用这个算子时，返回一个新的，由 (K,V) pairs 组成的 DStream，每一个 key 的值均由给定的 reduce 函数聚合起来。注意：在默认情况下，这个算子利用了 Spark 默认的并发任务数去分组。你可以用 numTasks 参数设置不同的任务数。

有状态转化

UpdateStateByKey 操作

下面是官网描述的翻译：

该 updateStateByKey 操作允许您维护任意状态，同时不断更新新信息。你需要通过两步来使用它。

定义 state - state 可以是任何的数据类型。
定义 state update function（状态更新函数）- 使用函数指定如何使用先前状态来更新状态，并从输入流中指定新值。

在每个 batch 中，Spark 会使用状态更新函数为所有已有的 key 更新状态，不管在 batch 中是否含有新的数据。如果这个更新函数返回一个 none，这个 key-value pair 也会被消除。

使用updateStateByKey重新编写wordCount

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WorldCount {

  def main(args: Array[String]) {

    // 定义更新状态方法，参数values为当前批次单词频度，state为以往批次单词频度
    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.foldLeft(0)(_ + _)
      val previousCount = state.getOrElse(0)
      Some(currentCount + previousCount)
    }

    val conf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.checkpoint("hdfs://hadoop:9000/chk")

    val lines = ssc.socketTextStream("hadoop", 9999)
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))

    val stateDstream = pairs.updateStateByKey[Int](updateFunc)
    stateDstream.print()

    //val wordCounts = pairs.reduceByKey(_ + _)
    //wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }

}

mapWithState操作

理解起来和updateStateByKey差不多。
和map操作相比就是维护了历史状态


import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}

object WordCountWithState {

  //自定义mappingFunction，累加单词出现的次数并更新状态
  val mappingFunc = (word: String, count: Option[Int], state: State[Int]) => {
    val sum = count.getOrElse(0) + state.getOption.getOrElse(0)
    //必须进行的是历史状态的更新，然后要把累加的结果返回
    state.update(sum)
    (word, sum)
  }

  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
    val ssc = new StreamingContext(sparkconf,Seconds(5))
    ssc.checkpoint("data/chk")

    val scoketDStream = ssc.socketTextStream("localhost", 8888)
    val wordPair = scoketDStream.flatMap(_.split("\\s+"))
        .map(word => (word,1))

    wordPair.mapWithState(StateSpec.function(mappingFunc))
          .foreachRDD(rdd =>{
          	rdd.foreach(println(_))
          })


    ssc.start()
    ssc.awaitTermination()
  }
}