SparkStreaming02

最新推荐文章于 2022-10-26 21:56:53 发布

原创最新推荐文章于 2022-10-26 21:56:53 发布 · 1.1k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#BigData

本文围绕SparkStreaming展开，介绍了一些算子，如性能更好的mapWithState、用于Dstream操作RDD的transform；阐述了action操作，重点是foreachRDD及其设计模式；还提及其他操作，包括DF和SQL操作、窗口计算、连接操作等，以及广播变量和应用监控等内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、SparkStreaming一些算子

1、mapWithState VS updateStateByKey

性能比updateStateByKey更好。

2、transform

Dstream 操作RDD 使用该算子

（1）场景

数据一：日志信息

domain,time,traffic

ruozedata.com

baidu.com

数据二：已有的文件（黑名单）

domain

baidu

====》

（2）SparkCore实现

代码：

package com.HBinz.spark.streaming.day02

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object leftJoinApp {
  def main(args: Array[String]): Unit = {

    val sparkconf = new SparkConf().setAppName("leftJoinApp").setMaster("local[2]")
    val sc = new SparkContext(sparkconf)
    //数据一
    val input1 = new ListBuffer[(String,Long)]
    input1.append(("www.ruozedata.com",6666))
    input1.append(("www.ruozedata.com",7777))
    input1.append(("www.baidu.com",8888))
    //将可变List的input1转为RDD
    val data1 = sc.parallelize(input1)
    //数据二
    val input2 = new ListBuffer[(String,Boolean)]
    input2.append(("www.baidu.com",true))
    //将可变List的input2转为RDD
    val data2 = sc.parallelize(input2)
    //leftjoin:将左边表有的数据都取出来，并被右表有的数据标记true
    //leftOuterJoin方法传参的RDD是K-V格式，所以input2加上布尔类型做判断
    data1.leftOuterJoin(data2)
      //去除(www.baidu.com,(9999,Some(true)))
      .filter(x=>{
      x._2._2.getOrElse(false) != true
      })
      //拿到(www.ruozedata.com,xxxx)
      .map(x=>{
        (x._1,x._2._1)
      })
      .collect().foreach(println)
    sc.stop
  }
}

（2）SparkStreaming实现

代码：

package com.HBinz.spark.streaming.day02

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.ListBuffer

object leftJoinStreamingApp {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setMaster("local[2]").setAppName("leftJoinStreamingApp")
    val scc = new StreamingContext(sparkconf,Seconds(10))
    //数据二,RDD类型
    val input2 = new ListBuffer[(String,Boolean)]
    input2.append(("www.baidu.com",true))
    val data2 = scc.sparkContext.parallelize(input2)
    //数据一: nc -lk过来
    val lines = scc.socketTextStream("hadoop002",8888)
    //将流数据以逗号拆分,组合成（www.baidu.com,www.baidu.com,xxxx)
    val data1 = lines.map(x=>{
      (x.split(",")(0),x)
    })
    //通过tranform方法将流数据从Dstream(一系列RDD构成)转换为RDD，每一个数据一的RDDleftJoin每个转为RDD类型的数据二
    .transform(rdd=>{
      rdd.leftOuterJoin(data2)
    })
    //去除(www.baidu.com,((www.baidu.com,9999),true))
    .filter(x=>{
    x._2._2.getOrElse(false) != true
    }).map(x=>{
      (x._2._1)
    }).print()
    scc.start()
    scc.awaitTermination()
  }
}

输入：

www.ruozedata.com,6666

www.ruozedata.com,7777

www.ruozedata.com,8888

www.baidu.com,9999

输出：

场景：

www.yy.com

测试的时候会打test.yy.com的日志过来，需要你过滤。生产上通过广播变量的办法处理。

二、action(Output Operations on DStreams)

1、重点：foreachRDD()

场景：

主要用于输出到其他系统。

2、创建一个wc表，后面需要将RDD输出到wc表里

create database g3;

create table wc(word varchar(20),c int(10))；

3、foreachRDD的设计模式

This is incorrect as this requires the connection object to be serialized and sent from the driver to the worker. Such connection objects are rarely transferable across machines. This error may manifest as serialization errors (connection object not serializable), initialization errors (connection object needs to be initialized at the workers), etc. The correct solution is to create the connection object at the worker.

它允许将数据发送到外部系统。但是，理解如何正确和高效地使用这个原语是很重要的。下面是一些要避免的常见错误。

通常，向外部系统写入数据需要创建一个connection（例如到远程服务器的TCP连接），并使用它将数据发送到远程系统。为此，开发人员可能会无意中尝试在Spark驱动程序上创建一个连接对象，然后尝试在Spark工作中使用它来将记录保存在RDDs中。例如，（在Scala中：

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // executed at the driver
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

按照官网的方法代码：

package com.HBinz.spark.streaming.day02

import java.sql.{Driver, DriverManager}

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object foreachRDDApp {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName("foreachRDDApp").setMaster("local[2]")
    val scc = new StreamingContext(sparkconf,Seconds(10))
    val lines = scc.socketTextStream("hadoop002",8888)
    val result = lines.flatMap(_.split(",")).map(x=>(x,1)).reduceByKey(_ + _)
    //TODO...result ===>  MySQL
    result.foreachRDD { rdd =>
      val connection = createNewConnection()  // 在每个RDD里面创建一个connection
      rdd.foreach { record =>
        //record的数据结构（word,次数）
        val word = record._1
        val count = record._2
        //将nc -ls过来的wc分别插入到Mysql的wc表里
        val sql = s"insert into wc(word,c) values ($word,$count)"
        //执行
        connection.createStatement().execute(sql)
      }
    }
    scc.start()
    scc.awaitTermination()
  }
  //实现createNewConectio方法
  def createNewConnection()={
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection("jdbc:mysql://hadoop002:3306/g3","root","123456")
  }
}

然而：这是不正确的，因为这需要序列化连接对象并将其从驱动程序发送到工作线程。此类连接对象很少跨计算机进行传输。这个错误可能表现为序列化错误（连接对象不可序列化）、初始化错误（连接对象需要在工作者处初始化）等。正确的解决方案是在工作端创建连接对象。

另外的做法：将connection放到executor里

//TODO...result ===>  MySQL
result.foreachRDD { rdd =>
  rdd.foreach { record =>
    //record的数据结构（word,次数）
    val connection = createNewConnection()  // 在每个RDD里面创建一个connection
    val word = record._1
    val count = record._2
    //将nc -ls过来的wc分别插入到Mysql的wc表里,word是String/varchar类型，所以这里要''
    val sql = s"insert into wc(word,c) values ('$word',$count)"
    //执行
    connection.createStatement().execute(sql)

但，还是会有个大问题，加入你的rdd数目很大，会造成不断请求connection，浪费性能。

最终：使用foreachPartition

A better solution is to userdd.foreachPartition - create a single connection object and send all the records in a RDD partition using that connection.

代码：

package com.HBinz.spark.streaming.day02

import java.sql.{Driver, DriverManager}

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object foreachRDDApp {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName("foreachRDDApp").setMaster("local[2]")
    val scc = new StreamingContext(sparkconf,Seconds(10))
    val lines = scc.socketTextStream("hadoop002",8888)
    val result = lines.flatMap(_.split(",")).map(x=>(x,1)).reduceByKey(_ + _)
    //TODO...result ===>  MySQL
//    result.foreachRDD { rdd =>
//      rdd.foreach { record =>
//        //record的数据结构（word,次数）
//        val connection = createNewConnection()  // 在每个RDD里面创建一个connection
//        val word = record._1
//        val count = record._2
//        //将nc -ls过来的wc分别插入到Mysql的wc表里,word是String/varchar类型，所以这里要''
//        val sql = s"insert into wc(word,c) values ('$word',$count)"
//        //执行
//        connection.createStatement().execute(sql)
//      }
//    }
      result.foreachRDD { rdd =>
      rdd.foreachPartition { partitionOfRecords =>
        val connection = createNewConnection()
        partitionOfRecords.foreach(record => {
                  val connection = createNewConnection()  // 在每个RDD里面创建一个connection
                  val word = record._1
                  val count = record._2
                  //将nc -ls过来的wc分别插入到Mysql的wc表里,word是String/varchar类型，所以这里要''
                  val sql = s"insert into wc(word,c) values ('$word',$count)"
                  //执行
                  connection.createStatement().execute(sql)
        })
        connection.close()
      }
    }
    scc.start()
    scc.awaitTermination()
  }
  //实现createNewConectio方法
  def createNewConnection()={
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection("jdbc:mysql://hadoop002:3306/g3","root","123456")
  }
}

测试：

（1）清理MySql数据

truncate table wc;

（2）提交

最优：连接池

（3）拓展：了解bonecp（http://www.jolbox.com/）

pom文件导入依赖

<dependency>
  <groupId>com.jolbox</groupId>
  <artifactId>bonecp</artifactId>
  <version>0.8.0.RELEASE</version>
</dependency>

1）BoneCPConfig

2）getConnection

总结：

1、foreachRDD

2、foreachPartition

3、Partition里面的进行操作

三、其他

1、DF and SQL操作

很容易去使用DF和SQL操作Streamingdata，你可以使用SparkContext创建SparkSession，从而使用SparkSession创建StreamingContext。它可以在driver失败之后重启。

就是：

将DStream转RDD,转DF，最后用SparkSQL处理。

/** DataFrame operations inside your streaming program */

val words: DStream[String] = ...

words.foreachRDD { rdd =>

  // Get the singleton instance of SparkSession
  val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
  import spark.implicits._

  // Convert RDD[String] to DataFrame
  val wordsDataFrame = rdd.toDF("word")

  // Create a temporary view
  wordsDataFrame.createOrReplaceTempView("words")

  // Do word count on DataFrame using SQL and print it
  val wordCountsDataFrame = 
    spark.sql("select word, count(*) as total from words group by word")
  wordCountsDataFrame.show()
}

2、Caching / Persistence

尽量不适用Caching，因为默认就保存在内存。

3、 Window Operations

Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data. The following figure illustrates this sliding window.

Spark Streaming还提供了窗口计算，允许您通过滑动数据窗口应用申请转换。

（1）重点：

窗口长度-窗口的持续时间（图3）。

滑动间隔——执行窗口操作的间隔（图中为2）。

这两个参数必须是源DStream的批处理间隔的倍数（图中的1）。

（2）案例：

每隔10s计算最近30s的数据

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

（3）其他函数