Reading Data
display(dbutils.fs.ls('/mnt/training/ecommerce/events/events-2020-07-03.json'))
schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"
# hourly events logged from the BedBricks website on July 3, 2020
hourlyEventsPath = "/mnt/training/ecommerce/events/events-2020-07-03.json"
df = (spark.readStream
.schema(schema)
.option("maxFilesPerTrigger", 1)
.json(hourlyEventsPath)
)
Cast to timestamp and add watermark for 2 hours
-
Add column **** by dividing **** by 1M and casting to timestamp
-
Add watermark for 2 hours
from pyspark.sql.functions import col
eventsDF = (df.withColumn("createdAt", (col("event_timestamp") / 1e6).cast("timestamp"))
.withWatermark("createdAt", "2 hours")
)
Aggregate active users by traffic source for 1 hour windows
-
Set default shuffle partitions to number of cores on your cluster (not required, but runs faster)
-
Group by **** with a 1 hour window based on the **** column
-
Aggregate the approximate count of distinct users and alias with “active_users”
-
Select ****, ****, and the **** extracted from **** with alias “hour”
-
Sort by ****
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism) # spark.sparkContext.defaultParallelism
from pyspark.sql.functions import approx_count_distinct, hour, window
trafficDF = (eventsDF.groupBy("traffic_source", window(col("createdAt"), "1 hour")).agg(
approx_count_distinct("user_id").alias("active_users"))
.select(col("traffic_source"), col("active_users"), hour(col("window.start")).alias("hour"))
.sort("hour")
)
Execute query with display() and plot results
-
Execute results for **** using display()
- Set the **** parameter to set a name for the query
-
Plot the streaming query results as a bar graph
-
Configure the following plot options:
-
Keys: ****
-
Series groupings: ****
-
Values: ****
-
display(trafficDF, streamName="hourly_traffic_p")
会生成一张动态的图
Manage streaming query
-
Iterate over SparkSession’s list of active streams to find one with name “hourly_traffic”
-
Stop the streaming query
untilStreamIsReady("hourly_traffic_p")
for s in spark.streams.active:
if s.name == "hourly_traffic_p":
s.stop()