Spark Streaming介绍及基础操作

最新推荐文章于 2025-04-22 10:28:15 发布

烙痕

最新推荐文章于 2025-04-22 10:28:15 发布

阅读量552

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark

本文链接：https://blog.youkuaiyun.com/qq_37408712/article/details/82759632

Spark 专栏收录该内容

16 篇文章

订阅专栏

本文介绍了Spark Streaming的基本概念，包括DStream、流处理模型和核心组件。通过socketTextStream和textFileStream两种编程模型展示了如何创建和操作DStream。还提到了无Receiver的Direct模式，CheckPoint和updateStateByKey实现状态管理，以及WAL和数据黑名单等关键特性。最后，讨论了窗口操作的参数和用法。

Spark Streaming介绍：
   基于Spark之上的流处理（rdd）
   流：source ==> compute ==> store
   离线是特殊的流

letting you write streaming jobs,the same way you write batch jobs

out of the box 开箱即用 OOTB（内置的）

DStream整合RDD就需要使用transform算子

Steanming、Core 、SQL比较：
Steanming : DStream <= represents a continuous stream of data
Core：RDD
SQL: DF/DS

Streaming入口：StreamingContext
Core：SparkContext
SQL：
   SparkSession
   SQLContext/HiveContext

编程模型一 socketTextStream：
import org.apache.spark._
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

该连接模式有receiver，如下图所示：

由以下提示可知，socket: receiver占用一个core
18/09/07 22:42:41 WARN StreamingContext:
spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.

对DStream做一个操作，其实就是对这个DStream底层的所有RDD都做相同的操作

Direct模式：
编程模型二 textFileStream（无Receiver）：
import org.apache.spark._
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.textFileStream("/streaming/input/")
val words = lines.flatMap(_.split("\t"))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

编程模型三 CheckPoint&updateStateByKey：
def updateFunction(currentValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
val curr = currentValues.sum
val pre = preValues.getOrElse(0)
Some(curr + pre)
}

import org.apache.spark._
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(10))
ssc.checkpoint("/streaming/checkpoint/")
val lines = ssc.socketTextStream("hadoop000",9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val result = pairs.updateStateByKey(updateFunction)
result.print()
ssc.start()
ssc.awaitTermination()

打包提交：
./spark-submit \
--master local[2] \
--name StreamingStateApp \
--class com.ruozedata.spark.streaming.day01.StreamingStateApp \
/home/hadoop/lib/g3-spark-1.0.jar

即可累加计算

mapWithState（性能相对较高）：mapWithState实现有状态管理主要是通过两点：a)历史状态需要在内存中维护，updateStateBykey也是一样。b)自定义更新状态的mappingFunction，该函数视业务需求而定。
详细解释及源码分析

WAL（WriteAheadLog）：
WAL框架及实现

数据黑名单：

package com.ruozedata.spark.streaming.day02
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer

object LeftJoinApp {

  def main(args: Array[String]) {

    val sparkConf = new SparkConf().setAppName("LeftJoinApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    // 数据一
    val input1 = new ListBuffer[(String,Long)]
    input1.append(("www.ruozedata.com", 8888))
    input1.append(("www.ruozedata.com", 9999))
    input1.append(("www.baidu.com", 7777))
    val data1 = sc.parallelize(input1)

    // 数据二
    //设置黑名单
    val input2 = new ListBuffer[(String,Boolean)]
    input2.append(("www.baidu.com",true))
    val data2 = sc.parallelize(input2)
    //根据黑名单过滤数据
    data1.leftOuterJoin(data2)
        .filter(x=>{
          x._2._2.getOrElse(false) != true
        })
        .map(x=> (x._1,x._2._1))
      .collect().foreach(println)
    sc.stop()
  }
}

foreachRDD使用的设计模式：官网中讲解了几种常见的错误用法，下述代码为推荐用法
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    //通过静态连接池来重用连接对象为最优选择，避免了重复创建对象的开销。
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

窗口操作：
参数：窗口长度-窗口的持续时间。滑动间隔-执行窗口操作的时间间隔。

示例：

val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))