Structured Streaming

Spark2.0新增了Structured Streaming,它是基于SparkSQL构建的可扩展和容错的流式数据处理引擎,使得实时流式数据计算可以和离线计算采用相同的处理方式(DataFrame&SQL)。Structured Streaming顾名思义,它将数据源和计算结果都映射成一张”结构化”的表,在计算的时候以结构化的方式去操作数据流,大大方便和提高了数据开发的效率。

Spark2.0之前,流式计算通过Spark Streaming进行:

spark

使用Spark Streaming每次只能消费当前批次内的数据,当然可以通过window操作,消费过去一段时间(多个批次)内的数据。举个简例子,需要每隔10秒,统计当前小时的PV和UV,在数据量特别大的情况下,使用window操作并不是很好的选择,通常是借助其它如Redis、HBase等完成数据统计。

 

Structured Streaming将数据源和计算结果都看做是无限大的表,数据源中每个批次的数据,经过计算,都添加到结果表中作为行。

spark

先试试官方给的例子,在本地启动NetCat: nc -lk 9999

./spark-shell(以local模式进入spark-shell命令行),运行下面的程序:

import org.apache.spark.sql.functions._

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

  import spark.implicits._

val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy("value").count()

val query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

在NetCat会话中输入”apache spark”,spark-shell中显示:


在NetCat会话中分两次再输入”apache hadoop”,”hadoop spark”, spark-shell中显示:


可以看到,每个Batch显示的结果,都是完整的WordCount统计结果,这便是结算结果输出中的完整模式(Complete Mode)。


关于结算结果的输出,有三种模式:

  1. Complete Mode:输出最新的完整的结果表数据。
  2. Append Mode:只输出结果表中本批次新增的数据,其实也就是本批次中的数据;
  3. Update Mode(暂不支持):只输出结果表中被本批次修改的数据;

这些Output,可以直接通过连接器(如MySQL JDBC、HBase API等)写入外部存储系统。

其余的模式自己测试即可,
注意:Append模式不支持基于数据流上的聚合操作(Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets);


### Structured Streaming in Apache Spark: Introduction Apache Spark offers Structured Streaming as part of its comprehensive suite of big data processing capabilities[^1]. This feature allows developers to process streaming data using the same DataFrame or Dataset APIs used for batch processing. In essence, Structured Streaming treats streams like tables where rows continuously append over time. #### Key Features and Concepts - **Event Time Processing**: Supports event-time semantics which means operations can be based on timestamps within records rather than when they arrive at system boundaries. - **Fault Tolerance Guarantees**: Ensures exactly-once fault-tolerant end-to-end semantics by combining checkpointing and Write-Ahead Logs (WAL). - **Late Data Handling**: Provides mechanisms to handle late arriving events gracefully through watermark policies. ```python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("StructuredStreamingExample").getOrCreate() # Read from a source such as Kafka lines = spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("subscribe", "topic_name") \ .load() ``` This code snippet demonstrates how one might begin setting up a simple stream reading operation with Kafka as input source. For those looking into deeper integration possibilities beyond basic setup: - Integration with other libraries becomes straightforward due to shared abstractions across different modules provided under Spark umbrella projects mentioned earlier. - Users have access not only to core functionalities but also specialized extensions tailored towards specific needs like machine learning pipelines via MLlib alongside real-time analytics tasks facilitated by this framework's design philosophy emphasizing ease-of-use without compromising power or flexibility.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值