spark streaming

最新推荐文章于 2020-05-19 21:59:34 发布

原创最新推荐文章于 2020-05-19 21:59:34 发布 · 340 阅读

2 ·

CC 4.0 BY-SA版权

spark 专栏收录该内容

19 篇文章

订阅专栏

本文介绍了 Spark Streaming 的基本概念、工作原理及其核心组件 DStream 的使用方法。通过实战案例，展示了如何利用 Spark Streaming 处理实时数据流，包括数据读取、转换及窗口操作等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 spark streaming基本概念

Spark streaming是Spark核心API的一个扩展，它对实时流式数据的处理具有可扩展性、高吞吐量、可容错性等特点。我们可以从kafka、flume、Twitter、 ZeroMQ、Kinesis等源获取数据，也可以通过由高阶函数map、reduce、join、window等组成的复杂算法计算出数据。最后，处理后的数据可以推送到文件系统、数据库、实时仪表盘中。

2 Spark Streaming的工作原理

A: Spark Streaming把实时输入数据流以时间片为单位切分成块

B: 然后把每块数据作为一个RDD，并使用RDD操作处理每一小块数据

C: 每个块都会生成一个Spark Job处理，最终结果也返回多块

3 spark stream处理原理

Spark Streaming支持一个高层的抽象，叫做离散流(discretized stream)或者DStream，它代表连续的数据流。DStream既可以利用从Kafka, Flume和Kinesis等源获取的输入数据流创建，也可以在其他DStream的基础上通过高阶函数获得。在内部，DStream是由一系列RDDs组成。

用户能够利用scala、java或者Python来编写Spark Streaming程序。

4 一个spark stream简单例子

import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.apache.spark.{SparkConf, SparkContext}

object wordcount {

def main(args: Array[String]): Unit = {

val conf=new SparkConf()

conf.setMaster("local[2]")

conf.setAppName("简单spark sql测试")

val ctx=new StreamingContext(conf,Seconds(2))

val lines=ctx.socketTextStream("192.168.162.128",9999)

val rdda=lines.flatMap(x=>x.split("\\s+"))

var rdd=rdda.map(wd=>(wd,1))

rdd.reduceByKey(_+_)

rdd.print()

ctx.start()

ctx.awaitTermination()

ctx.stop(true)

}

需要注意的地方：

一旦一个context已经启动，就不能有新的流算子建立或者是添加到context中。

一旦一个context已经停止，它就不能再重新启动

在JVM中，同一时间只能有一个StreamingContext处于活跃状态

在StreamingContext上调用stop()方法，也会关闭SparkContext对象。如果只想仅关闭StreamingContext对象，设置stop()的可选参数为false

一个SparkContext对象可以重复利用去创建多个StreamingContext对象，前提条件是前面的StreamingContext在后面StreamingContext创建之前关闭（不关闭SparkContext）。

5 输入DStream

输入DStreams表示从数据源获取输入数据流的DStreams。

输入DStreams表示从数据源获取的原始数据流。Spark Streaming拥有两类数据源

基本源（Basic sources）：这些源在StreamingContext API中直接可用。例如文件系统、套接字连接等。
高级源（Advanced sources）：这些源包括Kafka,Flume,Kinesis,Twitter等等。
需要注意的是，如果你想在一个流应用中并行地创建多个输入DStream来接收多个数据流，你能够创建多个输入流（这将在性能调优那一节介绍）。它将创建多个Receiver同时接收多个数据流。但是，receiver作为一个长期运行的任务运行在Spark worker或executor中。因此，它占有一个核，这个核是分配给Spark Streaming应用程序的所有核中的一个（it occupies one of the cores allocated to the Spark Streaming application）。所以，为Spark Streaming应用程序分配足够的核（如果是本地运行，那么是线程）用以处理接收的数据并且运行receiver是非常重要的

6 DStream中的转换（transformation）

和RDD类似，transformation允许从输入DStream来的数据被修改。DStreams支持很多在RDD中可用的transformation算子。

Transformation	Meaning
map(func)	利用函数`func`处理原DStream的每个元素，返回一个新的DStream
flatMap(func)	与map相似，但是每个输入项可用被映射为0个或者多个输出项
filter(func)	返回一个新的DStream，它仅仅包含源DStream中满足函数func的项
repartition(numPartitions)	通过创建更多或者更少的partition改变这个DStream的并行级别(level of parallelism)
union(otherStream)	返回一个新的DStream,它包含源DStream和otherStream的联合元素
count()	通过计算源DStream中每个RDD的元素数量，返回一个包含单元素(single-element)RDDs的新DStream
reduce(func)	利用函数func聚集源DStream中每个RDD的元素，返回一个包含单元素(single-element)RDDs的新DStream。函数应该是相关联的，以使计算可以并行化
countByValue()	这个算子应用于元素类型为K的DStream上，返回一个（K,long）对的新DStream，每个键的值是在原DStream的每个RDD中的频率。
reduceByKey(func, [numTasks])	当在一个由(K,V)对组成的DStream上调用这个算子，返回一个新的由(K,V)对组成的DStream，每一个key的值均由给定的reduce函数聚集起来。注意：在默认情况下，这个算子利用了Spark默认的并发任务数去分组。你可以用`numTasks`参数设置不同的任务数
join(otherStream, [numTasks])	当应用于两个DStream（一个包含（K,V）对,一个包含(K,W)对），返回一个包含(K, (V, W))对的新DStream
cogroup(otherStream, [numTasks])	当应用于两个DStream（一个包含（K,V）对,一个包含(K,W)对），返回一个包含(K, Seq[V], Seq[W])的元组
transform(func)	通过对源DStream的每个RDD应用RDD-to-RDD函数，创建一个新的DStream。这个可以在DStream中的任何RDD操作中使用

7 WindowOperations（窗口操作）

Spark还提供了窗口的计算，它允许你使用一个滑动窗口应用在数据变换中。下图说明了该滑动窗口。

如图所示，每个时间窗口在一个个DStream中划过，每个DSteam中的RDD进入Window中进行合并，操作时生成为

窗口化DSteam的RDD。在上图中，该操作被应用在过去的3个时间单位的数据，和划过了2个时间单位。这说明任

何窗口操作都需要指定2个参数：

window length（窗口长度）：窗口的持续时间（上图为3个时间单位）
sliding interval （滑动间隔）- 窗口操作的时间间隔（上图为2个时间单位）。

上面的2个参数的大小，必须是接受产生一个DStream时间的倍数

让我们用一个例子来说明窗口操作。比如说，你想用以前的WordCount的例子，来计算最近30s的数据的中的单词

数，10S接受为一个DStream。为此，我们要用reduceByKey操作来计算最近30s数据中每一个DSteam中关于

（word，1）的pair操作。它可以用reduceByKeyAndWindow操作来实现。一些常见的窗口操作如下。所有这些操作

都需要两个参数--- window length（窗口长度）和sliding interval（滑动间隔）。

-------------------------实验数据----------------------------------------------------------------------

spark

Streaming

better

than

storm

you

need

yes

（每秒在其中随机抽取一个，作为Socket端的输入），socket端的数据模拟和实验函数等程序见附录百度云链接

-----------------------------------------------window操作-------------------------------------------------------------------------

//输入:窗口长度（隐：输入的滑动窗口长度为形成Dstream的时间）
//输出：返回一个DStream,這个DStream包含這个滑动窗口下的全部元素
def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)
//输入:窗口长度和滑动窗口长度
//输出：返回一个DStream,這个DStream包含這个滑动窗口下的全部元素
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
new WindowedDStream(this, windowDuration, slideDuration)
}

import org.apache.log4j.{Level, Logger}

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object windowOnStreaming {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----window
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the Window operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
val data = socketStreaming.map(x =>(x,1))
//def window(windowDuration: Duration): DStream[T]
val getedData1 = data.window(Seconds(6))
println("windowDuration only : ")
getedData1.print()
//same as
// def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
//val getedData2 = data.window(Seconds(9),Seconds(3))
//println("Duration and SlideDuration : ")
//getedData2.print()
ssc.start()
ssc.awaitTermination()
}
}

--------------------reduceByKeyAndWindow操作--------------------------------

/**通过对每个滑动过来的窗口应用一个reduceByKey的操作，返回一个DSream，有点像
* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口，hash分区是采用spark集群
* 默认的分区树
* @param reduceFunc 从左到右的reduce 函数
* @param windowDuration 窗口时间
* 滑动窗口默认是1个batch interval
* 分区数是是RDD默认（depend on spark集群core）
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())
}
/**通过对每个滑动过来的窗口应用一个reduceByKey的操作，返回一个DSream，有点像
* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口，hash分区是采用spark集群
* 默认的分区树
* @param reduceFunc 从左到右的reduce 函数
* @param windowDuration 窗口时间
* @param slideDuration 滑动时间
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
}
/**通过对每个滑动过来的窗口应用一个reduceByKey的操作，返回一个DSream，有点像
* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口，hash分区是采用spark集群
* 默认的分区树
* @param reduceFunc 从左到右的reduce 函数
* @param windowDuration 窗口时间
* @param slideDuration 滑动时间
* @param numPartitions 每个RDD的分区数.
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,
defaultPartitioner(numPartitions))
}
/**
/**通过对每个滑动过来的窗口应用一个reduceByKey的操作，返回一个DSream，有点像
* `DStream.reduceByKey(),但是只是這个函数只是应用在滑动过来的窗口，hash分区是采用spark集群
* 默认的分区树
* @param reduceFunc 从左到右的reduce 函数
* @param windowDuration 窗口时间
* @param slideDuration 滑动时间
* @param numPartitions 每个RDD的分区数.
* @param partitioner 设置每个partition的分区数
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner
): DStream[(K, V)] = ssc.withScope {
self.reduceByKey(reduceFunc, partitioner)
.window(windowDuration, slideDuration)
.reduceByKey(reduceFunc, partitioner)
}
/**
*通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作
* hash分区是采用spark集群，默认的分区树
* @param reduceFunc从左到右的reduce 函数
* @param invReduceFunc inverse reduce function; such that for all y, invertible x:
* `invReduceFunc(reduceFunc(x, y), x) = y`
* @param windowDuration窗口时间
* @param slideDuration 滑动时间
* @param filterFunc 来赛选一定条件的 key-value 对的
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration = self.slideDuration,
numPartitions: Int = ssc.sc.defaultParallelism,
filterFunc: ((K, V)) => Boolean = null
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(
reduceFunc, invReduceFunc, windowDuration,
slideDuration, defaultPartitioner(numPartitions), filterFunc
)
}
/**
*通过对每个滑动过来的窗口应用一个reduceByKey的操作.同时对old RDDs进行了invReduceFunc操作
* hash分区是采用spark集群，默认的分区树
* @param reduceFunc从左到右的reduce 函数
* @param invReduceFunc inverse reduce function; such that for all y, invertible x:
* `invReduceFunc(reduceFunc(x, y), x) = y`
* @param windowDuration窗口时间
* @param slideDuration 滑动时间
* @param partitioner 每个RDD的分区数.
* @param filterFunc 来赛选一定条件的 key-value 对的
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner,
filterFunc: ((K, V)) => Boolean
): DStream[(K, V)] = ssc.withScope {
val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
new ReducedWindowedDStream[K, V](
self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
windowDuration, slideDuration, partitioner
)
}

import org.apache.log4j.{Level, Logger}

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object reduceByWindowOnStreaming {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----reduceByKeyAndWindow
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
val data = socketStreaming.map(x =>(x,1))
//def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration ): DStream[(K, V)]
//val getedData1 = data.reduceByKeyAndWindow(_+_,Seconds(6))
val getedData2 = data.reduceByKeyAndWindow(_+_,
(a,b) => a+b*0
,Seconds(6),Seconds(2))
val getedData1 = data.reduceByKeyAndWindow(_+_,_-_,Seconds(9),Seconds(6))
println("reduceByKeyAndWindow : ")
getedData1.print()
ssc.start()
ssc.awaitTermination()
}
}

這里出现了invReduceFunc函数這个函数有点特别，一不注意就会出错，现在通过分析源码中的

ReducedWindowedDStream這个类内部来进行说明：

------------------reduceByWindow操作---------------------------

/输入：reduceFunc、窗口长度、滑动长度
//输出：（a,b）为从几个从左到右一次取得两个元素
//（，a,b）进入reduceFunc,
def reduceByWindow(
reduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = ssc.withScope {
this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
}
/**
*输入reduceFunc，invReduceFunc，窗口长度、滑动长度
*/
def reduceByWindow(
reduceFunc: (T, T) => T,
invReduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = ssc.withScope {
this.map((1, _))
.reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
.map(_._2)
}

import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object reduceByWindow {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----reduceByWindow
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
//val data = socketStreaming.reduceByWindow(_+_,Seconds(6),Seconds(2))
val data = socketStreaming.reduceByWindow(_+_,_+_,Seconds(6),Seconds(2))
println("reduceByWindow: count the number of elements")
data.print()
ssc.start()
ssc.awaitTermination()
}
}

-----------------------------------------------countByWindow操作---------------------------------

/**
* 输入窗口长度和滑动长度，返回窗口内的元素数量
* @param windowDuration 窗口长度
* @param slideDuration 滑动长度
*/
def countByWindow(
windowDuration: Duration,
slideDuration: Duration): DStream[Long] = ssc.withScope {
this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
//窗口下的DStream进行map操作，把每个元素变为1之后进行reduceByWindow操作
}

import org.apache.log4j.{Level, Logger}

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object countByWindow {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----countByWindow
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
val data = socketStreaming.countByWindow(Seconds(6),Seconds(2))
println("countByWindow: count the number of elements")
data.print()
ssc.start()
ssc.awaitTermination()
}
}

-------------------------------- countByValueAndWindow-------------

/**

*输入窗口长度、滑动时间、RDD分区数（默认分区是等于并行度）
* @param windowDuration width of the window; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., the interval after which
* the new DStream will generate RDDs); must be a multiple of this
* DStream's batching interval
* @param numPartitions number of partitions of each RDD in the new DStream.
*/
def countByValueAndWindow(
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int = ssc.sc.defaultParallelism)
(implicit ord: Ordering[T] = null)
: DStream[(T, Long)] = ssc.withScope {
this.map((_, 1L)).reduceByKeyAndWindow(
(x: Long, y: Long) => x + y,
(x: Long, y: Long) => x - y,
windowDuration,
slideDuration,
numPartitions,
(x: (T, Long)) => x._2 != 0L)
}

import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object countByValueAndWindow {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----countByValueAndWindow
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
val data = socketStreaming.countByValueAndWindow(Seconds(6),Seconds(2))
println("countByWindow: count the number of elements")
data.print()
ssc.start()
ssc.awaitTermination()
}
}