sparkstreaming下的第一个word count程序（python版）

最新推荐文章于 2025-09-22 16:29:55 发布

原创最新推荐文章于 2025-09-22 16:29:55 发布 · 424 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark #sparkstreaming #pyspark #python #数据流

spark 专栏收录该内容

20 篇文章

订阅专栏

本文介绍如何使用Spark Streaming从指定端口读取实时数据流，并实现每秒更新的单词频率统计。通过Python编写代码，创建本地StreamingContext，设置批处理间隔为1秒，连接至localhost:9999接收数据，对数据流进行切分、映射及reduceByKey操作，最终打印每批数据前十个元素的统计结果。

首先从socket中读取数据，然后通过sparkstreaming统计输入的单词个数

1.通过下面命令开启端口（报错则需安装 nc）

nc -lk 9999

2.编写sparkstreaming.py代码

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
#至少需要2个核，因为需要有一个核用于读取数据
sc = SparkContext("local[2]", "NetworkWordCount")
#间隔一秒读取一次数据流
ssc = StreamingContext(sc, 1)


# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

该段代码的作用是，每隔1s时间，从9999端口读取该时间段内输入的数据，并统计读取到的数据的word count。

3.spark-submit --master local sparkstreaming.py运行上述代码。

当在步骤1的窗口中输入数据，则在运行spark的窗口可以看到统计结果。