1.概述
客户想使用 row_number 函数做是实时流处理
2.代码如下
@Test
def rowNumberTest1(): Unit ={
val spark = SparkSession.builder
.appName("structured")
.master("local[4]")
.getOrCreate
val host = "localhost"
val port = "9999"
logInfo(String.format("监听主机:%s,端口:%s",host,port))
val map = spark.conf.getAll.mkString("\n")
logInfo(String.format("系统参数如下:\n %s ", map))
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
val schema = new StructType()
.add("id", StringType)
.add("name", StringType)
.add("age", StringType)
val dataDF:DataFrame = lines.select(from_json(col("value").cast("string"), schema) as "data")
.select("data.*")
dataDF.printSchema()
val waterMark_df = dataDF.select(to_timestamp(col("eventTime"),"yyyy-MM-dd HH:mm:ss") as 'event_time,col("*"))
.withWatermark("event_time","10 seconds")
waterMark_df.printSchema()
waterMark_df.createOrReplaceTempView("test_table")
val sqls =
"""
| SELECT id,name,age from ( SELECT id,name,event_time,age, row_number() OVER (PARTITION BY event_time ORDER BY age DESC) rank from test_table ) tmp_user
|
""".stripMargin
println(sqls)
val queryDF = spark.sql(sqls)
val query = queryDF.writeStream
.outputMode(OutputMode.Update())
.format("console")
.option("truncate", "false")
.trigger(Trigger.ProcessingTime(10, TimeUnit.SECONDS))
.start
query.awaitTermination()
}
结果报错
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;;
Window [row_number() windowspecdefinition(id#4, age#6 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#10], [id#4], [age#6 DESC NULLS LAST]
+- Project [id#4, name#5, age#6]
+- SubqueryAlias `test_table`
这个问题好像是流处理不支持这种操作。
我还没找到原因,有人遇到同样问题的话,可以给我留言吗?
是否有row_number函数呢?测试如下,发现是有这个函数的
/**
* 测试点:测试 spark 是否有 row_number 函数
*
* +---+----------+
* | id|row_number|
* +---+----------+
* | 0| 1|
* | 1| 2|
* | 2| 3|
* | 3| 4|
* | 4| 5|
* +---+----------+
*
* 运行如下,可以看到是有这个函数的
*/
@Test
def rowNumberTest2(): Unit = {
val spark = SparkSession.builder
.appName("structured")
.master("local[4]")
.getOrCreate
val df = spark.range(5)
val windowSpec = Window.orderBy("id")
val resultDF = df.withColumn("row_number", row_number().over(windowSpec))
resultDF.show()
}
一起使用,可以根据指定的排序规则来分配行号。
下面是row_number()函数的一个示例用法:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df = spark.range(5)
val windowSpec = Window.orderby("id")
val resultDF = df.withColumn("row_number", row_number().over(windowSpec))
resultDF.show()
这个示例代码会生成一个包含"id"和"row_number"两列的数据框,其中"row_number"列会为每行分配一个连续的行号,按照"id"列的升序排列。
请注意,row_number()函数需要配合窗口函数一起使用,通过指定窗口排序规则来确定行号的分配顺序。同时,row_number()函数在处理大量数据时可能会引发性能问题,因此建议谨慎使用。
2.1 缺字典
根据报错是不是缺少字典呢?
/**
* 测试点:客户想使用 row_number 函数做是实时流处理
* https://blog.youkuaiyun.com/qq_21383435/article/details/97633547
*
* 加上字典信息之后报错
*
* 23/07/13 16:53:57 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.
* root
* |-- id: string (nullable = true)
* |-- name: string (nullable = true)
* |-- age: string (nullable = true)
* |-- eventTime: timestamp (nullable = true)
*
* root
* |-- event_time: timestamp (nullable = true)
* |-- id: string (nullable = true)
* |-- name: string (nullable = true)
* |-- age: string (nullable = true)
* |-- eventTime: timestamp (nullable = true)
*
*
* SELECT id,name,age from ( SELECT id,name,event_time,age, row_number() OVER (PARTITION BY event_time ORDER BY age DESC) rank from test_table ) tmp_user
*
*
*
* org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;;
* Window [row_number() windowspecdefinition(event_time#12-T10000ms, age#6 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#18], [event_time#12-T10000ms], [age#6 DESC NULLS LAST]
* +- Project [id#4, name#5, event_time#12-T10000ms, age#6]
* +- SubqueryAlias `test_table`
* +- EventTimeWatermark event_time#12: timestamp, interval 10 seconds
* +- Project [to_timestamp('eventTime, Some(yyyy-MM-dd HH:mm:ss)) AS event_time#12, id#4, name#5, age#6, eventTime#7]
*
* 这里说 非基于时间的窗口不支持在流式DataFrame/Dataset上使用。
* 难道批处理才支持这个
*/
@Test
def rowNumberTest1_1(): Unit = {
val spark = SparkSession.builder
.appName("structured")
.master("local[4]")
.getOrCreate
val host = "localhost"
val port = "9999"
logInfo(String.format("监听主机:%s,端口:%s", host, port))
val map = spark.conf.getAll.mkString("\n")
logInfo(String.format("系统参数如下:\n %s ", map))
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
val schema = new StructType()
.add("id", StringType)
.add("name", StringType)
.add("age", StringType)
.add("eventTime", org.apache.spark.sql.types.TimestampType)
val dataDF: DataFrame = lines.select(from_json(col("value").cast("string"), schema) as "data")
.select("data.*")
dataDF.printSchema()
val waterMark_df = dataDF.select(to_timestamp(col("eventTime"), "yyyy-MM-dd HH:mm:ss") as 'event_time, col("*"))
.withWatermark("event_time", "10 seconds")
waterMark_df.printSchema()
waterMark_df.createOrReplaceTempView("test_table")
val sqls =
"""
| SELECT id,name,age from ( SELECT id,name,event_time,age, row_number() OVER (PARTITION BY event_time ORDER BY age DESC) rank from test_table ) tmp_user
|
""".stripMargin
println(sqls)
val queryDF = spark.sql(sqls)
val query = queryDF.writeStream
.outputMode(OutputMode.Update())
.format("console")
.option("truncate", "false")
.trigger(Trigger.ProcessingTime(10, TimeUnit.SECONDS))
.start
query.awaitTermination()
}
你在使用流式数据框/数据集(Streaming DataFrames/Datasets)上尝试使用非基于时间的窗口操作。然而,在流式处理中,只支持基于时间的窗口操作,而不支持非基于时间的窗口操作。
基于时间的窗口操作可以使用window函数或者Watermark来定义,用于对数据流进行时间窗口的划分和聚合。而非基于时间的窗口操作(如滑动窗口、滚动窗口等)是针对静态数据集进行的,并不适用于动态的流式数据。
如果你需要在Spark流式处理中进行窗口操作,你可以考虑使用基于时间的窗口函数,例如window函数、tumbling函数、sliding函数等。这些函数可以帮助你定义流式数据的时间窗口,并进行相应的聚合操作。
以下是一个使用基于时间的窗口函数进行流式处理的简单示例:
scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val streamingDF = spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val windowedCounts = streamingDF
.groupby(window($"timestamp", "1 minute"))
.count()
val query = windowedCounts.writeStream
.outputMode(OutputMode.Update())
.format("console")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
query.awaitTermination()
在这个示例中,我们从一个socket源读取数据流,并使用窗口函数window对数据进行1分钟的窗口划分,然后对每个窗口内的数据进行计数。
请确保使用基于时间的窗口函数来处理流式数据,避免使用非基于时间的窗口操作。
代码中使用了非时间窗口的窗口函数 row_number(),而在流式DataFrame上不支持非时间窗口的窗口函数。
根据你的错误信息,可以看到以下部分提示:
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;;
Window [row_number() windowspecdefinition(event_time#12-T10000ms, age#6 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#18], [event_time#12-T10000ms], [age#6 DESC NULLS LAST]
解决这个问题的方法是改用仅支持时间窗口的窗口函数,例如rank()函数。你需要将代码中的row_number()替换为支持流式DataFrame的窗口函数,然后重新运行代码。
2.2 继续改造
/**
* 然后尝试继续改造
* 这里报错
*
* org.apache.spark.sql.AnalysisException: socket is not a valid Spark SQL Data Source.;
*/
@Test
def rowNumberTest1_2(): Unit = {
val spark = SparkSession.builder
.appName("structured")
.master("local[4]")
.getOrCreate
val host = "localhost"
val port = "9999"
logInfo(String.format("监听主机:%s,端口:%s", host, port))
val map = spark.conf.getAll.mkString("\n")
logInfo(String.format("系统参数如下:\n %s ", map))
// 将流式DataFrame转换为批处理DataFrame
val lines = spark.read
.format("socket")
.option("host", host)
.option("port", port)
.load()
val schema = new StructType()
.add("id", StringType)
.add("name", StringType)
.add("age", StringType)
.add("eventTime", org.apache.spark.sql.types.TimestampType)
val dataDF: DataFrame = lines.select(from_json(col("value").cast("string"), schema) as "data")
.select("data.*")
dataDF.printSchema()
val waterMark_df = dataDF.select(to_timestamp(col("eventTime"), "yyyy-MM-dd HH:mm:ss") as 'event_time, col("*"))
.withWatermark("event_time", "10 seconds")
waterMark_df.printSchema()
waterMark_df.createOrReplaceTempView("test_table")
val sqls =
"""
| SELECT id,name,age from (
| SELECT id,name,event_time,age,
| row_number() OVER (PARTITION bY event_time ORDER bY age DESC) as rank
| from test_table
| ) tmp_user
""".stripMargin
println(sqls)
val queryDF = spark.sql(sqls)
queryDF.show(false) // 显示结果,可以根据需求调整输出方式
// 结束后关闭SparkSession
spark.close()
}
Non-time-based windows are not supported on streaming DataFrames/Datasets;;
如果你希望继续使用row_number()函数而不使用rank()函数,你需要确保将代码转换为批处理模式,因为row_number()函数不支持在流式DataFrame上执行
。
最终改成如下批处理才可以运行起来
/**
* 测试点:最终使用批处理运行出来结果
*
* root
* |-- id: string (nullable = true)
* |-- name: string (nullable = true)
* |-- age: string (nullable = true)
* |-- eventTime: string (nullable = true)
* |-- event_time: timestamp (nullable = true)
*
*
* SELECT id, name, age, event_time,
* row_number() OVER (PARTITION bY event_time ORDER bY age DESC) as rank
* FROM test_table
*
* 本次测试成功运行如下
*
* +---+-----+---+-------------------+----+
* |id |name |age|event_time |rank|
* +---+-----+---+-------------------+----+
* |7 |Mike |31 |2023-07-12 10:18:00|1 |
* |8 |Lisa |26 |2023-07-12 10:18:00|2 |
* |4 |Jane |32 |2023-07-12 10:16:00|1 |
* |3 |bob |28 |2023-07-12 10:16:00|2 |
*/
@Test
def rowNumberTest1_3(): Unit = {
val spark = SparkSession.builder
.appName("structured")
.master("local[4]")
.getOrCreate
val data = Seq(
("1", "John", "25", "2023-07-12 10:15:00"),
("2", "Alice", "30", "2023-07-12 10:15:00"),
("3", "bob", "28", "2023-07-12 10:16:00"),
("4", "Jane", "32", "2023-07-12 10:16:00"),
("5", "Tom", "27", "2023-07-12 10:17:00"),
("6", "Emma", "29", "2023-07-12 10:17:00"),
("7", "Mike", "31", "2023-07-12 10:18:00"),
("8", "Lisa", "26", "2023-07-12 10:18:00")
)
import spark.implicits._
val dataDF = data.toDF("id", "name", "age", "eventTime")
.withColumn("event_time", to_timestamp($"eventTime", "yyyy-MM-dd HH:mm:ss"))
dataDF.printSchema()
dataDF.createOrReplaceTempView("test_table")
val sqls =
"""
| SELECT id, name, age, event_time,
| row_number() OVER (PARTITION bY event_time ORDER bY age DESC) as rank
| FROM test_table
""".stripMargin
println(sqls)
val queryDF = spark.sql(sqls)
queryDF.show(false)
// 结束后关闭SparkSession
spark.close()
}
最终原因 row_number()函数不支持在流式DataFrame上执行