11-Aggregating Streams_aggregatestream-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_33246702/article/details/124341477

Reading Data

display(dbutils.fs.ls('/mnt/training/ecommerce/events/events-2020-07-03.json'))

在这里插入图片描述

schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

# hourly events logged from the BedBricks website on July 3, 2020
hourlyEventsPath = "/mnt/training/ecommerce/events/events-2020-07-03.json"

df = (spark.readStream
  .schema(schema)
  .option("maxFilesPerTrigger", 1)
  .json(hourlyEventsPath)
)

Cast to timestamp and add watermark for 2 hours

Add column **** by dividing **** by 1M and casting to timestamp
Add watermark for 2 hours

from pyspark.sql.functions import col

eventsDF = (df.withColumn("createdAt", (col("event_timestamp") / 1e6).cast("timestamp"))
  .withWatermark("createdAt", "2 hours")
)

Aggregate active users by traffic source for 1 hour windows

Set default shuffle partitions to number of cores on your cluster (not required, but runs faster)
Group by **** with a 1 hour window based on the **** column
Aggregate the approximate count of distinct users and alias with “active_users”
Select ****, ****, and the **** extracted from **** with alias “hour”
Sort by ****

spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism) # spark.sparkContext.defaultParallelism

from pyspark.sql.functions import approx_count_distinct, hour, window

trafficDF = (eventsDF.groupBy("traffic_source", window(col("createdAt"), "1 hour")).agg(
    approx_count_distinct("user_id").alias("active_users"))
  .select(col("traffic_source"), col("active_users"), hour(col("window.start")).alias("hour"))
  .sort("hour")
)

Execute query with display() and plot results

Execute results for **** using display()
- Set the **** parameter to set a name for the query
Plot the streaming query results as a bar graph
Configure the following plot options:
- Keys: ****
- Series groupings: ****
- Values: ****

display(trafficDF, streamName="hourly_traffic_p")

会生成一张动态的图

在这里插入图片描述

Manage streaming query

Iterate over SparkSession’s list of active streams to find one with name “hourly_traffic”
Stop the streaming query

untilStreamIsReady("hourly_traffic_p")

for s in spark.streams.active:
  if s.name == "hourly_traffic_p":
    s.stop()