spark streaming updateStateByKey 用法

本文详细介绍了Spark Streaming中updateStateByKey函数的功能和使用方法,包括其源码解析及应用场景。通过三个具体示例(StatefulNetworkWordCount、NetworkWordCount、WebPagePopularityValueCalculator),展示了如何利用此函数来维护和更新状态数据。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

文章来源:http://blog.youkuaiyun.com/stark_summer/article/details/47666337

 

spark streaming updateStateByKey 用法

分类: spark   12308人阅读  评论(3)  收藏  举报
  1. updateStateByKey 解释: 
    以DStream中的数据进行按key做reduce操作,然后对各个批次的数据进行累加 
    在有新的数据信息进入或更新时,可以让用户保持想要的任何状。使用这个功能需要完成两步: 
    1) 定义状态:可以是任意数据类型 
    2) 定义状态更新函数:用一个函数指定如何使用先前的状态,从输入流中的新值更新状态。 
    对于有状态操作,要不断的把当前和历史的时间切片的RDD累加计算,随着时间的流失,计算的数据规模会变得越来越大。

  2. updateStateByKey源码:

    /**

    • Return a new “state” DStream where the state for each key is updated by applying
    • the given function on the previous state of the key and the new values of the key.
    • org.apache.spark.Partitioner is used to control the partitioning of each RDD.
    • @param updateFunc State update function. If this function returns None, then
    • corresponding state key-value pair will be eliminated.
    • @param partitioner Partitioner for controlling the partitioning of each RDD in the new
    • DStream.
    • @param initialRDD initial state value of each key.
    • @tparam S State type 
      */ 
      def updateStateByKey[S: ClassTag]( 
      updateFunc: (Seq[V], Option[S]) => Option[S], 
      partitioner: Partitioner, 
      initialRDD: RDD[(K, S)] 
      ): DStream[(K, S)] = { 
      val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => { 
      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s))) 

      updateStateByKey(newUpdateFunc, partitioner, true, initialRDD) 
      }
  3. 代码实现

    • StatefulNetworkWordCount

      <code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">object StatefulNetworkWordCount {
      def main(<span class="hljs-attribute" style="box-sizing: border-box; color: rgb(0, 136, 0);">args</span>: Array[String]) {
      <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (args.length < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) {
        System.err.println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage: StatefulNetworkWordCount <hostname> <port>"</span>)
        System.exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)
      }
      
      Logger.getLogger(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"org.apache.spark"</span>).setLevel(Level.WARN)
      
      val <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-title" style="box-sizing: border-box;">updateFunc</span> = <span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(values: Seq[Int], state: Option[Int])</span> =></span> {
        val currentCount = values.sum
      
        val previousCount = state.getOrElse(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>)
      
        Some(currentCount + previousCount)
      }
      
      val <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-title" style="box-sizing: border-box;">newUpdateFunc</span> = <span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(iterator: Iterator[(String, Seq[Int], Option[Int])])</span> =></span> {
        iterator.flatMap(t<span class="hljs-function" style="box-sizing: border-box;"> =></span> updateFunc(t._2, t._3).map(s<span class="hljs-function" style="box-sizing: border-box;"> =></span> (t._1, s)))
      }
      
      val sparkConf = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> SparkConf().setAppName(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"StatefulNetworkWordCount"</span>).setMaster(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"local"</span>)
      <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> Create the context <span class="hljs-reserved" style="box-sizing: border-box;">with</span> a <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> second batch size
      val ssc = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> StreamingContext(sparkConf, Seconds(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))
      ssc.checkpoint(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"."</span>)
      
      <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> Initial RDD input to updateStateByKey
      val initialRDD = ssc.sparkContext.parallelize(List((<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"hello"</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"world"</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)))
      
      <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> Create a ReceiverInputDStream <span class="hljs-literal" style="color: rgb(0, 102, 102); box-sizing: border-box;">on</span> target <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(0, 136, 0);">ip</span>:port <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">and</span> count the
      <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> words <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> input stream <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> \n delimited test (eg. generated <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">by</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'nc'</span>)
      val lines = ssc.socketTextStream(args(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>), args(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>).toInt)
      val words = lines.flatMap(_.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">" "</span>))
      val wordDstream = words.map(x<span class="hljs-function" style="box-sizing: border-box;"> =></span> (x, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))
      
      <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> Update the cumulative count using updateStateByKey
      <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> This will give a Dstream made <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> state (which <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">is</span> the cumulative count <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> the words)
      val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> HashPartitioner (ssc.sparkContext.defaultParallelism), <span class="hljs-literal" style="color: rgb(0, 102, 102); box-sizing: border-box;">true</span>, initialRDD)
      stateDstream.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">print</span>()
      ssc.start()
      ssc.awaitTermination()
      }
      }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li></ul>
    • NetworkWordCount

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.SparkConf</span>
import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.HashPartitioner</span>
import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.streaming</span>.{Seconds, StreamingContext}
import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.streaming</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.StreamingContext</span>._

object NetworkWordCount {
  def main(args: Array[String]) {
    if (args<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.length</span> < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) {
      System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.err</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.println</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage: NetworkWordCount <hostname> <port>"</span>)
      System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.exit</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)
    }


    val sparkConf = new SparkConf()<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.setAppName</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"NetworkWordCount"</span>)
    val ssc = new StreamingContext(sparkConf, Seconds(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>))
    //使用updateStateByKey前需要设置checkpoint
    ssc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.checkpoint</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"hdfs://master:8020/spark/checkpoint"</span>)

    val addFunc = (currValues: Seq[Int], prevValueState: Option[Int]) => {
      //通过Spark内部的reduceByKey按key规约,然后这里传入某key当前批次的Seq/List,再计算当前批次的总和
      val currentCount = currValues<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sum</span>
      // 已累加的值
      val previousCount = prevValueState<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.getOrElse</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>)
      // 返回累加后的结果,是一个Option[Int]类型
      Some(currentCount + previousCount)
    }

    val lines = ssc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.socketTextStream</span>(args(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>), args(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.toInt</span>)
    val words = lines<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.flatMap</span>(_<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.split</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">" "</span>))
    val pairs = words<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>(word => (word, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))

    //val currWordCounts = pairs<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.reduceByKey</span>(_ + _)
    //currWordCounts<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.print</span>()

    val totalWordCounts = pairs<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.updateStateByKey</span>[Int](addFunc)
    totalWordCounts<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.print</span>()

    ssc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.start</span>()
    ssc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.awaitTermination</span>()
  }
}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li></ul>
  • WebPagePopularityValueCalculator
<code class="hljs scala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">package</span> com.spark.streaming

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.{HashPartitioner, SparkConf}
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.streaming.kafka.KafkaUtils
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.streaming.{Duration, Seconds, StreamingContext}

<span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/**
 * ━━━━━━神兽出没━━━━━━
 *    ┏┓   ┏┓
 *   ┏┛┻━━━┛┻┓
 *   ┃       ┃
 *   ┃   ━   ┃
 *   ┃ ┳┛ ┗┳ ┃
 *   ┃       ┃
 *   ┃   ┻   ┃
 *   ┃       ┃
 *   ┗━┓   ┏━┛
 *     ┃   ┃神兽保佑, 永无BUG!
 *      ┃   ┃Code is far away from bug with the animal protecting
 *     ┃   ┗━━━┓
 *     ┃       ┣┓
 *     ┃       ┏┛
 *     ┗┓┓┏━┳┓┏┛
 *      ┃┫┫ ┃┫┫
 *      ┗┻┛ ┗┻┛
 * ━━━━━━感觉萌萌哒━━━━━━
 * Module Desc:
 * User: wangyue
 * DateTime: 15-11-9上午10:50
 */</span>
<span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">object</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">WebPagePopularityValueCalculator</span> {</span>

  <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">private</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> checkpointDir = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"popularity-data-checkpoint"</span>
  <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">private</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> msgConsumerGroup = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"user-behavior-topic-message-consumer-group"</span>

  <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> main(args: Array[String]) {

    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (args.length < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) {
      println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage:WebPagePopularityValueCalculator zkserver1:2181, zkserver2: 2181, zkserver3: 2181 consumeMsgDataTimeInterval (secs) "</span>)
      System.exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)
    }

    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> Array(zkServers, processingInterval) = args
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> conf = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> SparkConf().setAppName(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Web Page Popularity Value Calculator"</span>)

    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> ssc = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> StreamingContext(conf, Seconds(processingInterval.toInt))
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//using updateStateByKey asks for enabling checkpoint</span>
    ssc.checkpoint(checkpointDir)

    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> kafkaStream = KafkaUtils.createStream(
      <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//Spark streaming context</span>
      ssc,
      <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,...</span>
      zkServers,
      <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//kafka message consumer group ID</span>
      msgConsumerGroup,
      <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread</span>
      Map(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"user-behavior-topic"</span> -> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> msgDataRDD = kafkaStream.map(_._2)

    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//for debug use only</span>
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//println("Coming data in this interval...")</span>
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//msgDataRDD.print()</span>
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// e.g page37|5|1.5119122|-1</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> popularityData = msgDataRDD.map { msgLine => {
      <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> dataArr: Array[String] = msgLine.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\\|"</span>)
      <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> pageID = dataArr(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>)
      <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//calculate the popularity value</span>
      <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> popValue: Double = dataArr(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>).toFloat * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.8</span> + dataArr(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>).toFloat * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.8</span> + dataArr(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>).toFloat * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>
      (pageID, popValue)
    }
    }

    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//sum the previous popularity value and current value</span>
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//定义一个匿名函数去把网页热度上一次的计算结果值和新计算的值相加,得到最新的热度值。</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => {
      iterator.flatMap(t => {
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> newValue: Double = t._2.sum
        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> stateValue: Double = t._3.getOrElse(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>);
        Some(newValue + stateValue)
      }.map(sumedValue => (t._1, sumedValue)))
    }

    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> initialRDD = ssc.sparkContext.parallelize(List((<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"page1"</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.00</span>)))

    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//调用 updateStateByKey 原语并传入上面定义的匿名函数更新网页热度值。</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> stateDStream = popularityData.updateStateByKey[Double](updatePopularityValue,
      <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> HashPartitioner(ssc.sparkContext.defaultParallelism), <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">true</span>, initialRDD)

    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//set the checkpoint interval to avoid too frequently data checkpoint which may</span>
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//may significantly reduce operation throughput</span>
    stateDStream.checkpoint(Duration(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span> * processingInterval.toInt * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1000</span>))

    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//after calculation, we need to sort the result and only show the top 10 hot pages</span>
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//最后得到最新结果后,需要对结果进行排序,最后打印热度值最高的 10 个网页。</span>
    stateDStream.foreachRDD { rdd => {
      <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> sortedData = rdd.map { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">case</span> (k, v) => (v, k) }.sortByKey(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">false</span>)
      <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> topKData = sortedData.take(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>).map { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">case</span> (v, k) => (k, v) }
      topKData.foreach(x => {
        println(x)
      })
    }
    }

    ssc.start()
    ssc.awaitTermination()
  }
}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li><li style="box-sizing: border-box; padding: 0px 5px;">57</li><li style="box-sizing: border-box; padding: 0px 5px;">58</li><li style="box-sizing: border-box; padding: 0px 5px;">59</li><li style="box-sizing: border-box; padding: 0px 5px;">60</li><li style="box-sizing: border-box; padding: 0px 5px;">61</li><li style="box-sizing: border-box; padding: 0px 5px;">62</li><li style="box-sizing: border-box; padding: 0px 5px;">63</li><li style="box-sizing: border-box; padding: 0px 5px;">64</li><li style="box-sizing: border-box; padding: 0px 5px;">65</li><li style="box-sizing: border-box; padding: 0px 5px;">66</li><li style="box-sizing: border-box; padding: 0px 5px;">67</li><li style="box-sizing: border-box; padding: 0px 5px;">68</li><li style="box-sizing: border-box; padding: 0px 5px;">69</li><li style="box-sizing: border-box; padding: 0px 5px;">70</li><li style="box-sizing: border-box; padding: 0px 5px;">71</li><li style="box-sizing: border-box; padding: 0px 5px;">72</li><li style="box-sizing: border-box; padding: 0px 5px;">73</li><li style="box-sizing: border-box; padding: 0px 5px;">74</li><li style="box-sizing: border-box; padding: 0px 5px;">75</li><li style="box-sizing: border-box; padding: 0px 5px;">76</li><li style="box-sizing: border-box; padding: 0px 5px;">77</li><li style="box-sizing: border-box; padding: 0px 5px;">78</li><li style="box-sizing: border-box; padding: 0px 5px;">79</li><li style="box-sizing: border-box; padding: 0px 5px;">80</li><li style="box-sizing: border-box; padding: 0px 5px;">81</li><li style="box-sizing: border-box; padding: 0px 5px;">82</li><li style="box-sizing: border-box; padding: 0px 5px;">83</li><li style="box-sizing: border-box; padding: 0px 5px;">84</li><li style="box-sizing: border-box; padding: 0px 5px;">85</li><li style="box-sizing: border-box; padding: 0px 5px;">86</li><li style="box-sizing: border-box; padding: 0px 5px;">87</li><li style="box-sizing: border-box; padding: 0px 5px;">88</li><li style="box-sizing: border-box; padding: 0px 5px;">89</li><li style="box-sizing: border-box; padding: 0px 5px;">90</li><li style="box-sizing: border-box; padding: 0px 5px;">91</li><li style="box-sizing: border-box; padding: 0px 5px;">92</li><li style="box-sizing: border-box; padding: 0px 5px;">93</li><li style="box-sizing: border-box; padding: 0px 5px;">94</li><li style="box-sizing: border-box; padding: 0px 5px;">95</li><li style="box-sizing: border-box; padding: 0px 5px;">96</li><li style="box-sizing: border-box; padding: 0px 5px;">97</li><li style="box-sizing: border-box; padding: 0px 5px;">98</li><li style="box-sizing: border-box; padding: 0px 5px;">99</li><li style="box-sizing: border-box; padding: 0px 5px;">100</li><li style="box-sizing: border-box; padding: 0px 5px;">101</li><li style="box-sizing: border-box; padding: 0px 5px;">102</li><li style="box-sizing: border-box; padding: 0px 5px;">103</li><li style="box-sizing: border-box; padding: 0px 5px;">104</li><li style="box-sizing: border-box; padding: 0px 5px;">105</li><li style="box-sizing: border-box; padding: 0px 5px;">106</li><li style="box-sizing: border-box; padding: 0px 5px;">107</li><li style="box-sizing: border-box; padding: 0px 5px;">108</li></ul>

参考文章: 
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ 
https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
http://stackoverflow.com/questions/28998408/spark-streaming-example-calls-updatestatebykey-with-additional-parameters 
http://stackoverflow.com/questions/27535668/spark-streaming-groupbykey-and-updatestatebykey-implementation


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值