文章来源:http://blog.youkuaiyun.com/stark_summer/article/details/47666337
-
updateStateByKey 解释:
以DStream中的数据进行按key做reduce操作,然后对各个批次的数据进行累加
在有新的数据信息进入或更新时,可以让用户保持想要的任何状。使用这个功能需要完成两步:
1) 定义状态:可以是任意数据类型
2) 定义状态更新函数:用一个函数指定如何使用先前的状态,从输入流中的新值更新状态。
对于有状态操作,要不断的把当前和历史的时间切片的RDD累加计算,随着时间的流失,计算的数据规模会变得越来越大。 -
updateStateByKey源码:
/**
- Return a new “state” DStream where the state for each key is updated by applying
- the given function on the previous state of the key and the new values of the key.
- org.apache.spark.Partitioner is used to control the partitioning of each RDD.
- @param updateFunc State update function. If
this
function returns None, then - corresponding state key-value pair will be eliminated.
- @param partitioner Partitioner for controlling the partitioning of each RDD in the new
- DStream.
- @param initialRDD initial state value of each key.
- @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
initialRDD: RDD[(K, S)]
): DStream[(K, S)] = {
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
}
-
代码实现
-
StatefulNetworkWordCount
<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">object StatefulNetworkWordCount { def main(<span class="hljs-attribute" style="box-sizing: border-box; color: rgb(0, 136, 0);">args</span>: Array[String]) { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (args.length < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) { System.err.println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage: StatefulNetworkWordCount <hostname> <port>"</span>) System.exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>) } Logger.getLogger(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"org.apache.spark"</span>).setLevel(Level.WARN) val <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-title" style="box-sizing: border-box;">updateFunc</span> = <span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(values: Seq[Int], state: Option[Int])</span> =></span> { val currentCount = values.sum val previousCount = state.getOrElse(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) Some(currentCount + previousCount) } val <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-title" style="box-sizing: border-box;">newUpdateFunc</span> = <span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(iterator: Iterator[(String, Seq[Int], Option[Int])])</span> =></span> { iterator.flatMap(t<span class="hljs-function" style="box-sizing: border-box;"> =></span> updateFunc(t._2, t._3).map(s<span class="hljs-function" style="box-sizing: border-box;"> =></span> (t._1, s))) } val sparkConf = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> SparkConf().setAppName(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"StatefulNetworkWordCount"</span>).setMaster(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"local"</span>) <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> Create the context <span class="hljs-reserved" style="box-sizing: border-box;">with</span> a <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> second batch size val ssc = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> StreamingContext(sparkConf, Seconds(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)) ssc.checkpoint(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"."</span>) <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List((<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"hello"</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"world"</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))) <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> Create a ReceiverInputDStream <span class="hljs-literal" style="color: rgb(0, 102, 102); box-sizing: border-box;">on</span> target <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(0, 136, 0);">ip</span>:port <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">and</span> count the <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> words <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> input stream <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> \n delimited test (eg. generated <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">by</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'nc'</span>) val lines = ssc.socketTextStream(args(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>), args(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>).toInt) val words = lines.flatMap(_.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">" "</span>)) val wordDstream = words.map(x<span class="hljs-function" style="box-sizing: border-box;"> =></span> (x, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)) <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> Update the cumulative count using updateStateByKey <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> This will give a Dstream made <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> state (which <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">is</span> the cumulative count <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> the words) val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> HashPartitioner (ssc.sparkContext.defaultParallelism), <span class="hljs-literal" style="color: rgb(0, 102, 102); box-sizing: border-box;">true</span>, initialRDD) stateDstream.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">print</span>() ssc.start() ssc.awaitTermination() } }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li></ul>
-
NetworkWordCount
-
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.SparkConf</span> import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.HashPartitioner</span> import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.streaming</span>.{Seconds, StreamingContext} import org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.streaming</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.StreamingContext</span>._ object NetworkWordCount { def main(args: Array[String]) { if (args<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.length</span> < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) { System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.err</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.println</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage: NetworkWordCount <hostname> <port>"</span>) System<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.exit</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>) } val sparkConf = new SparkConf()<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.setAppName</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"NetworkWordCount"</span>) val ssc = new StreamingContext(sparkConf, Seconds(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>)) //使用updateStateByKey前需要设置checkpoint ssc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.checkpoint</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"hdfs://master:8020/spark/checkpoint"</span>) val addFunc = (currValues: Seq[Int], prevValueState: Option[Int]) => { //通过Spark内部的reduceByKey按key规约,然后这里传入某key当前批次的Seq/List,再计算当前批次的总和 val currentCount = currValues<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sum</span> // 已累加的值 val previousCount = prevValueState<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.getOrElse</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) // 返回累加后的结果,是一个Option[Int]类型 Some(currentCount + previousCount) } val lines = ssc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.socketTextStream</span>(args(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>), args(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.toInt</span>) val words = lines<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.flatMap</span>(_<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.split</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">" "</span>)) val pairs = words<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>(word => (word, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)) //val currWordCounts = pairs<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.reduceByKey</span>(_ + _) //currWordCounts<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.print</span>() val totalWordCounts = pairs<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.updateStateByKey</span>[Int](addFunc) totalWordCounts<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.print</span>() ssc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.start</span>() ssc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.awaitTermination</span>() } }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li></ul>
- WebPagePopularityValueCalculator
<code class="hljs scala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">package</span> com.spark.streaming <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.{HashPartitioner, SparkConf} <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.streaming.kafka.KafkaUtils <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> org.apache.spark.streaming.{Duration, Seconds, StreamingContext} <span class="hljs-javadoc" style="color: rgb(136, 0, 0); box-sizing: border-box;">/** * ━━━━━━神兽出没━━━━━━ * ┏┓ ┏┓ * ┏┛┻━━━┛┻┓ * ┃ ┃ * ┃ ━ ┃ * ┃ ┳┛ ┗┳ ┃ * ┃ ┃ * ┃ ┻ ┃ * ┃ ┃ * ┗━┓ ┏━┛ * ┃ ┃神兽保佑, 永无BUG! * ┃ ┃Code is far away from bug with the animal protecting * ┃ ┗━━━┓ * ┃ ┣┓ * ┃ ┏┛ * ┗┓┓┏━┳┓┏┛ * ┃┫┫ ┃┫┫ * ┗┻┛ ┗┻┛ * ━━━━━━感觉萌萌哒━━━━━━ * Module Desc: * User: wangyue * DateTime: 15-11-9上午10:50 */</span> <span class="hljs-class" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">object</span> <span class="hljs-title" style="box-sizing: border-box; color: rgb(102, 0, 102);">WebPagePopularityValueCalculator</span> {</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">private</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> checkpointDir = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"popularity-data-checkpoint"</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">private</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> msgConsumerGroup = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"user-behavior-topic-message-consumer-group"</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> main(args: Array[String]) { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (args.length < <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>) { println(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Usage:WebPagePopularityValueCalculator zkserver1:2181, zkserver2: 2181, zkserver3: 2181 consumeMsgDataTimeInterval (secs) "</span>) System.exit(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>) } <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> Array(zkServers, processingInterval) = args <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> conf = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> SparkConf().setAppName(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Web Page Popularity Value Calculator"</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> ssc = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> StreamingContext(conf, Seconds(processingInterval.toInt)) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//using updateStateByKey asks for enabling checkpoint</span> ssc.checkpoint(checkpointDir) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> kafkaStream = KafkaUtils.createStream( <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//Spark streaming context</span> ssc, <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,...</span> zkServers, <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//kafka message consumer group ID</span> msgConsumerGroup, <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread</span> Map(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"user-behavior-topic"</span> -> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> msgDataRDD = kafkaStream.map(_._2) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//for debug use only</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//println("Coming data in this interval...")</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//msgDataRDD.print()</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// e.g page37|5|1.5119122|-1</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> popularityData = msgDataRDD.map { msgLine => { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> dataArr: Array[String] = msgLine.split(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"\\|"</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> pageID = dataArr(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//calculate the popularity value</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> popValue: Double = dataArr(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>).toFloat * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.8</span> + dataArr(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>).toFloat * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.8</span> + dataArr(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>).toFloat * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> (pageID, popValue) } } <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//sum the previous popularity value and current value</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//定义一个匿名函数去把网页热度上一次的计算结果值和新计算的值相加,得到最新的热度值。</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => { iterator.flatMap(t => { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> newValue: Double = t._2.sum <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> stateValue: Double = t._3.getOrElse(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>); Some(newValue + stateValue) }.map(sumedValue => (t._1, sumedValue))) } <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> initialRDD = ssc.sparkContext.parallelize(List((<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"page1"</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.00</span>))) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//调用 updateStateByKey 原语并传入上面定义的匿名函数更新网页热度值。</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> stateDStream = popularityData.updateStateByKey[Double](updatePopularityValue, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">new</span> HashPartitioner(ssc.sparkContext.defaultParallelism), <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">true</span>, initialRDD) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//set the checkpoint interval to avoid too frequently data checkpoint which may</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//may significantly reduce operation throughput</span> stateDStream.checkpoint(Duration(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span> * processingInterval.toInt * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1000</span>)) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//after calculation, we need to sort the result and only show the top 10 hot pages</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//最后得到最新结果后,需要对结果进行排序,最后打印热度值最高的 10 个网页。</span> stateDStream.foreachRDD { rdd => { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> sortedData = rdd.map { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">case</span> (k, v) => (v, k) }.sortByKey(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">false</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">val</span> topKData = sortedData.take(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>).map { <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">case</span> (v, k) => (k, v) } topKData.foreach(x => { println(x) }) } } ssc.start() ssc.awaitTermination() } }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li><li style="box-sizing: border-box; padding: 0px 5px;">57</li><li style="box-sizing: border-box; padding: 0px 5px;">58</li><li style="box-sizing: border-box; padding: 0px 5px;">59</li><li style="box-sizing: border-box; padding: 0px 5px;">60</li><li style="box-sizing: border-box; padding: 0px 5px;">61</li><li style="box-sizing: border-box; padding: 0px 5px;">62</li><li style="box-sizing: border-box; padding: 0px 5px;">63</li><li style="box-sizing: border-box; padding: 0px 5px;">64</li><li style="box-sizing: border-box; padding: 0px 5px;">65</li><li style="box-sizing: border-box; padding: 0px 5px;">66</li><li style="box-sizing: border-box; padding: 0px 5px;">67</li><li style="box-sizing: border-box; padding: 0px 5px;">68</li><li style="box-sizing: border-box; padding: 0px 5px;">69</li><li style="box-sizing: border-box; padding: 0px 5px;">70</li><li style="box-sizing: border-box; padding: 0px 5px;">71</li><li style="box-sizing: border-box; padding: 0px 5px;">72</li><li style="box-sizing: border-box; padding: 0px 5px;">73</li><li style="box-sizing: border-box; padding: 0px 5px;">74</li><li style="box-sizing: border-box; padding: 0px 5px;">75</li><li style="box-sizing: border-box; padding: 0px 5px;">76</li><li style="box-sizing: border-box; padding: 0px 5px;">77</li><li style="box-sizing: border-box; padding: 0px 5px;">78</li><li style="box-sizing: border-box; padding: 0px 5px;">79</li><li style="box-sizing: border-box; padding: 0px 5px;">80</li><li style="box-sizing: border-box; padding: 0px 5px;">81</li><li style="box-sizing: border-box; padding: 0px 5px;">82</li><li style="box-sizing: border-box; padding: 0px 5px;">83</li><li style="box-sizing: border-box; padding: 0px 5px;">84</li><li style="box-sizing: border-box; padding: 0px 5px;">85</li><li style="box-sizing: border-box; padding: 0px 5px;">86</li><li style="box-sizing: border-box; padding: 0px 5px;">87</li><li style="box-sizing: border-box; padding: 0px 5px;">88</li><li style="box-sizing: border-box; padding: 0px 5px;">89</li><li style="box-sizing: border-box; padding: 0px 5px;">90</li><li style="box-sizing: border-box; padding: 0px 5px;">91</li><li style="box-sizing: border-box; padding: 0px 5px;">92</li><li style="box-sizing: border-box; padding: 0px 5px;">93</li><li style="box-sizing: border-box; padding: 0px 5px;">94</li><li style="box-sizing: border-box; padding: 0px 5px;">95</li><li style="box-sizing: border-box; padding: 0px 5px;">96</li><li style="box-sizing: border-box; padding: 0px 5px;">97</li><li style="box-sizing: border-box; padding: 0px 5px;">98</li><li style="box-sizing: border-box; padding: 0px 5px;">99</li><li style="box-sizing: border-box; padding: 0px 5px;">100</li><li style="box-sizing: border-box; padding: 0px 5px;">101</li><li style="box-sizing: border-box; padding: 0px 5px;">102</li><li style="box-sizing: border-box; padding: 0px 5px;">103</li><li style="box-sizing: border-box; padding: 0px 5px;">104</li><li style="box-sizing: border-box; padding: 0px 5px;">105</li><li style="box-sizing: border-box; padding: 0px 5px;">106</li><li style="box-sizing: border-box; padding: 0px 5px;">107</li><li style="box-sizing: border-box; padding: 0px 5px;">108</li></ul>
参考文章:
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/
https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
http://stackoverflow.com/questions/28998408/spark-streaming-example-calls-updatestatebykey-with-additional-parameters
http://stackoverflow.com/questions/27535668/spark-streaming-groupbykey-and-updatestatebykey-implementation