背景:
整个Spark Streaming是按照Batch Duractions划分Job的。但是很多时候我们需要算过去的一天甚至一周的数据,这个时候不可避免的要进行状态管理,而Spark Streaming每个Batch Duractions都会产生一个Job,Job里面都是RDD,所以此时面临的问题就是怎么对状态进行维护?这个时候就需要借助updateStateByKey和mapWithState方法完成核心的步骤。
源码分析:
1. 无论是updateStateByKey还是mapWithState方法在DStream中均没有,但是是通过隐身转换函数实现其功能。
object DStream {
// `toPairDStreamFunctions` was in SparkContext before1.3 and users had to
// `import StreamingContext._` toenable it. Now we move it here to make the compiler find
// it automatically. However, we stillkeep the old function in StreamingContext for backward
// compatibility and forward to thefollowing function directly.
implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V],ord: Ordering[K] = null):
PairDStreamFunctions[K, V] = {
new PairDStreamFunctions[K, V](stream)
}
updateStateByKey
1. 在PairDStreamFunctions中updateStateByKey具体实现如下:
updateStateByKey是在历史已有的状态基础上,采用updateFunc对历史数据进行更新。updateFunc进行操作,该函数的返回值是DStream类型的。
/**
* Return a new "state" DStreamwhere the state for each key is updated by applying
* the given function on the previousstate of the key and the new values of each key.
* Hash partitioning is used to generatethe RDDs with Spark's default number of partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding statekey-value pair will be eliminated.
* @tparam S State type
*/
def updateStateByKey[S:ClassTag](
//
updateFunc: (Seq[V], Option[S]) => Option[S]
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner())
}
2. defaultPartitioner
private[streaming] def defaultPartitioner(numPartitions:Int = self.ssc.sc.defaultParallelism) = {
new HashPartitioner(numPartitions)
}
3.partitioner就是控制RDD的每个patition
/**
* Return a new "state" DStreamwhere the state for each key is updated by applying
* the given function on the previousstate of the key and the new values of the key.
* org.apache.spark.Partitioner is usedto control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding statekey-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDDin the new
* DStream.
* @tparam S State type
*/
def updateStateByKey[S:ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF= sparkContext.clean(updateFunc)
val newUpdateFunc= (iterator: Iterator[(K, Seq[V], Option[S])])=> {
iterator.flatMap(t =>cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true)
}
4.rememberPartitioner默认为true
/**
* Return a new "state" DStreamwhere the state for each key is updated by applying
* the given function on the previousstate of the key and the new values of each key.
* org.apache.spark.Partitioner is usedto control the partitioning of each RDD.
* @param updateFunc State update function. Note, that this function maygenerate a different
* tuple with a different keythan the input key. Therefore keys may be removed
* or added in this way. It isup to the developer to decide whether to
* remember the partitionerdespite the key being changed.
* @param partitioner Partitioner for controlling the partitioning of each RDDin the new
* DStream
* @param rememberPartitioner Whether to remember the paritioner objectin the generated RDDs.
* @tparam S State type
*/
def updateStateByKey[S:ClassTag](
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean
): DStream[(K, S)] = ssc.withScope {
new StateDStream(self, ssc.sc.clean(updateFunc),partitioner, rememberPartitioner, None)
}
5.在StateDStream中,StorageLevel是直接存储到磁盘,因为此时的数据非常大
private[streaming]
class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](
parent: DStream[(K, V)],
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
preservePartitioning: Boolean,
initialRDD : Option[RDD[(K, S)]]
) extends DStream[(K, S)](parent.ssc) {
super.persist(StorageLevel.MEMORY_ONLY_SER)
在computeUsingPreiviousRDD源码如下:
private [this] def computeUsingPreviousRDD(
parentRDD : RDD[(K, V)], prevStateRDD : RDD[(K, S)]) = {
// Define the function for the mapPartition operation oncogrouped RDD;
// first map the cogrouped tuple totuples of required type,
// and then apply the update function
val updateFuncLocal= updateFunc
val finalFunc= (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {
val i =iterator.map(t => {
val itr =t._2._2.iterator
val headOption= if (itr.hasNext) Some(itr.next()) else None
(t._1, t._2._1.toSeq, headOption)
})
updateFuncLocal(i)
}
//cogroup每次计算的时候都会遍历prevSrateRDD中的所有parititioner的信息
val cogroupedRDD= parentRDD.cogroup(prevStateRDD, partitioner)
val stateRDD= cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)
Some(stateRDD)
}
所以,如果数据很多的时候不建议使用updateStateByKey。
updateStateByKey函数实现如下:
mapWithState:
1. 返回MapWithStateDStream函数,维护和更新历史状态都是基于Key。使用一个function对key-value形式的数据进行状态维护。
/**
* :: Experimental ::
* Return a [[MapWithStateDStream]] byapplying a function to every key-value element of
* `this` stream, while maintaining some state datafor each unique key. The mapping function
* and other specification (e.g.partitioners, timeouts, initial state data, etc.) of this
* transformation can be specified using [[StateSpec]] class.The state data is accessible in
* as a parameter of type [[State]] inthe mapping function.
*
* Example of using `mapWithState`:
* {{{
* // A mappingfunction that maintains an integer state and return a String
* //此时的state就可以看成一张表,这张表记录了状态维护中所有的历史状态。
* def mappingFunction(key: String, value: Option[Int], state: State[Int]):Option[String] = {
* // Use state.exists(), state.get(), state.update() and state.remove()
* // to manage state, and return the necessary string
* }
*
* val spec = StateSpec.function(mappingFunction).numPartitions(10)
*
* val mapWithStateDStream = keyValueDStream.mapWithState[StateType,MappedType](spec)
* }}}
*
* @param spec Specification ofthis transformation
* @tparam StateType Class type of thestate data
* @tparam MappedType Class type of themapped data
*/
@Experimental
def mapWithState[StateType: ClassTag, MappedType: ClassTag](
spec: StateSpec[K, V, StateType, MappedType]
): MapWithStateDStream[K, V, StateType, MappedType] = {
new MapWithStateDStreamImpl[K, V, StateType, MappedType](
self,
// StateSpecImpl类封装了StateSpec操作。
spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]
)
}
2. MapWithStateDStream源码如下:
/**
* :: Experimental ::
* DStream representing the stream ofdata generated by `mapWithState` operationon a
* [[org.apache.spark.streaming.dstream.PairDStreamFunctions pairDStream]].
* Additionally, it also gives access tothe stream of state snapshots, that is, the state data of
* all keys after a batch has updatedthem.
*
* @tparam KeyType Class of the key
* @tparam ValueType Class of the value
* @tparam StateType Class of the state data
* @tparam MappedType Class of the mapped data
*/
@Experimental
sealed abstract class MapWithStateDStream[KeyType, ValueType, StateType, MappedType: ClassTag](
ssc: StreamingContext) extends DStream[MappedType](ssc) {
/** Return a pair DStream where each RDD is the snapshotof the state of all the keys. */
def stateSnapshots():DStream[(KeyType, StateType)]
}
/** Internal implementation of the [[MapWithStateDStream]] */
private[streaming] class MapWithStateDStreamImpl[
KeyType:ClassTag, ValueType: ClassTag, StateType:ClassTag, MappedType: ClassTag](
dataStream: DStream[(KeyType, ValueType)],
spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])
extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {
private val internalStream =
new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)
override def slideDuration: Duration = internalStream.slideDuration
override def dependencies: List[DStream[_]] = List(internalStream)
//计算的时候是通过InternalMapWithStateDStream来实现的
override def compute(validTime: Time): Option[RDD[MappedType]] = {
internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }
}
StateSpecImpl
/** Internalimplementation of [[org.apache.spark.streaming.StateSpec]] interface.*/
private[streaming]
case class StateSpecImpl[K, V, S, T](
function: (Time, K, Option[V], State[S])=> Option[T]) extends StateSpec[K, V, S, T] {
require(function != null)
@volatile privatevar partitioner: Partitioner = null
@volatile privatevar initialStateRDD: RDD[(K, S)] = null
@volatile privatevar timeoutInterval: Duration = null
override def initialState(rdd: RDD[(K, S)]): this.type = {
this.initialStateRDD = rdd
this
}
override def initialState(javaPairRDD: JavaPairRDD[K, S]): this.type = {
this.initialStateRDD = javaPairRDD.rdd
this
}
MapWithStateDStreamImpl
/** Internalimplementation of the [[MapWithStateDStream]] */
private[streaming] class MapWithStateDStreamImpl[
KeyType:ClassTag, ValueType: ClassTag, StateType:ClassTag, MappedType: ClassTag](
dataStream: DStream[(KeyType, ValueType)],
spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])
extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {
3. 更新历史数据。
/**
* A DStream that allows per-key state tobe maintains, and arbitrary records to be generated
* based on updates to the state. This isthe main DStream that implements the `mapWithState`
* operation on DStreams.
*
* @param parent Parent (key, value) stream that is the source
* @param spec Specifications of the mapWithState operation
* @tparam K Key type
* @tparam V Value type
* @tparam S Type of the state maintained
* @tparam E Type of the mapped data
*/
private[streaming]
class InternalMapWithStateDStream[K:ClassTag, V: ClassTag, S:ClassTag, E: ClassTag](
parent: DStream[(K, V)], spec: StateSpecImpl[K, V, S, E])
extends DStream[MapWithStateRDDRecord[K, S, E]](parent.context) {
//不断的更新内存数据结构。
persist(StorageLevel.MEMORY_ONLY)
private val partitioner = spec.getPartitioner().getOrElse(
new HashPartitioner(ssc.sc.defaultParallelism))
private val mappingFunction = spec.getFunction()
override def slideDuration: Duration = parent.slideDuration
override def dependencies: List[DStream[_]] = List(parent)
/** Enable automatic checkpointing */
override val mustCheckpoint = true
/** Override the default checkpoint duration */
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
4. MapWithStateDStream.Compute
/** Method that generates a RDD for the given time */
override def compute(validTime: Time):Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {
// Get the previous state or create a new empty state RDD
val prevStateRDD= getOrCompute(validTime - slideDuration) match {
case Some(rdd) =>
if (rdd.partitioner != Some(partitioner)) {
// If the RDD is not partitioned the right way, let usrepartition it using the
// partition index as the key.This is to ensure that state RDD is always partitioned
// before creating anotherstate RDD using it
MapWithStateRDD.createFromRDD[K, V, S, E](
rdd.flatMap {_.stateMap.getAll() }, partitioner, validTime)
} else {
rdd
}
case None=>
MapWithStateRDD.createFromPairRDD[K, V, S, E](
spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),
partitioner,
validTime
)
}
// Compute the new state RDD with previous state RDD andpartitioned data RDD
// Even if there is no data RDD, usean empty one to create a new state RDD
//基于时间窗口创建RDD
val dataRDD = parent.getOrCompute(validTime).getOrElse {
context.sparkContext.emptyRDD[(K, V)]
}
val partitionedDataRDD= dataRDD.partitionBy(partitioner)
val timeoutThresholdTime= spec.getTimeoutInterval().map { interval =>
(validTime - interval).milliseconds
}
Some(new MapWithStateRDD(
prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))
}
}