Spark Streaming介绍及基础操作

本文介绍了Spark Streaming的基本概念,包括DStream、流处理模型和核心组件。通过socketTextStream和textFileStream两种编程模型展示了如何创建和操作DStream。还提到了无Receiver的Direct模式,CheckPoint和updateStateByKey实现状态管理,以及WAL和数据黑名单等关键特性。最后,讨论了窗口操作的参数和用法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Spark Streaming介绍:
    基于Spark之上的流处理(rdd)
    流:source ==> compute ==> store
    离线是特殊的流
 
letting you write streaming jobs,the same way you write batch jobs 
 
out of the box  开箱即用 OOTB(内置的)

DStream整合RDD就需要使用transform算子

Steanming、Core 、SQL比较:
Steanming : DStream <=   represents a continuous stream of data
Core:RDD
SQL:  DF/DS

Streaming入口:StreamingContext
Core:SparkContext
SQL: 
    SparkSession
    SQLContext/HiveContext
    
编程模型一 socketTextStream:
import org.apache.spark._
import org.apache.spark.streaming._    
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()             
ssc.awaitTermination()   

该连接模式有receiver,如下图所示:

由以下提示可知,socket: receiver占用一个core 
18/09/07 22:42:41 WARN StreamingContext: 
spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.

对DStream做一个操作,其实就是对这个DStream底层的所有RDD都做相同的操作

Direct模式:
编程模型二 textFileStream(无Receiver):

import org.apache.spark._
import org.apache.spark.streaming._    
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.textFileStream("/streaming/input/")
val words = lines.flatMap(_.split("\t"))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

 

编程模型三 CheckPoint&updateStateByKey:
def updateFunction(currentValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
val curr = currentValues.sum
val pre = preValues.getOrElse(0)
Some(curr + pre)
}

import org.apache.spark._
import org.apache.spark.streaming._    
val ssc = new StreamingContext(sc, Seconds(10))
ssc.checkpoint("/streaming/checkpoint/")
val lines = ssc.socketTextStream("hadoop000",9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val result = pairs.updateStateByKey(updateFunction)
result.print()
ssc.start()
ssc.awaitTermination()

打包提交:
./spark-submit \
--master local[2] \
--name StreamingStateApp \
--class com.ruozedata.spark.streaming.day01.StreamingStateApp \
/home/hadoop/lib/g3-spark-1.0.jar

即可累加计算

mapWithState(性能相对较高):mapWithState实现有状态管理主要是通过两点:a)历史状态需要在内存中维护,updateStateBykey也是一样。b)自定义更新状态的mappingFunction,该函数视业务需求而定。
详细解释及源码分析

WAL(WriteAheadLog):
WAL框架及实现

数据黑名单:

package com.ruozedata.spark.streaming.day02
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer

object LeftJoinApp {

  def main(args: Array[String]) {

    val sparkConf = new SparkConf().setAppName("LeftJoinApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    // 数据一
    val input1 = new ListBuffer[(String,Long)]
    input1.append(("www.ruozedata.com", 8888))
    input1.append(("www.ruozedata.com", 9999))
    input1.append(("www.baidu.com", 7777))
    val data1 = sc.parallelize(input1)

    // 数据二
    //设置黑名单
    val input2 = new ListBuffer[(String,Boolean)]
    input2.append(("www.baidu.com",true))
    val data2 = sc.parallelize(input2)
    //根据黑名单过滤数据
    data1.leftOuterJoin(data2)
        .filter(x=>{
          x._2._2.getOrElse(false) != true
        })
        .map(x=> (x._1,x._2._1))
      .collect().foreach(println)
    sc.stop()
  }
}

foreachRDD使用的设计模式:官网中讲解了几种常见的错误用法,下述代码为推荐用法
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    //通过静态连接池来重用连接对象为最优选择,避免了重复创建对象的开销。
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

窗口操作:
参数:
窗口长度-窗口的持续时间。  滑动间隔-执行窗口操作的时间间隔。 

示例:

val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值