目录
一、基本输入源:
1.文件输入流
(1)spark-shell
打开一个终端窗口1,(任意目录)启动进入pyspark,
pyspark
// 代码同下
监听程序只监听"…/streaming/logfile"目录下在程序启动后新增的文件,不会去处理历史上已经存在的文件。
logfile中新建日志文件:请打开另外一个终端窗口2,在当前目录下再新建一个log1.txt文件(复制粘贴之前的文件无效),
vim log.txt
运行结果在终端窗口2
(2)独立运行程序
python3 TestStreaming.py
from operator import add
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
conf = SparkConf()
conf.setAppName('TestDStream')
conf.setMaster('local[2]')
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 10)
lines = ssc.textFileStream('file:///home/zzp/PycharmProjects/streaming/logfile')
words = lines.flatMap(lambda line: line.split(' '))
wordCounts = words.map(lambda x : (x,1)).reduceByKey(add)
wordCounts.pprint()
ssc.start()
ssc.awaitTermination()
请打开另外一个终端窗口2,在当前目录logfile中再新建一个log1.txt文件(vim创建)
2.套接字流
新建NetworkWordCount.py代码文件
在当前文件目录下打开终端窗口1,开启监听窗口
sudo nc -lk 9999
在当前文件目录下打开终端窗口2,运行py文件
python3 NetworkWordCount.py localhost 9999
#/usr/local/spark/mycode/streaming/socket/NetworkWordCount.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: NetworkWordCount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
在nc第一个终端窗口窗口1中随意输入一些单词,回车
在nc第二个终端窗口窗口2中出现词频统计的运行结果
3.RDD队列流
新建一个TestRDDQueueStream.py文件
在当前文件目录下打开终端窗口1,运行py文件
python3 RDDQueueStream.py
#!/usr/bin/env python3
import time
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingQueueStream")
ssc = StreamingContext(sc, 2)
# Create the queue through which RDDs can be pushed to
# a QueueInputDStream
rddQueue = []
for i in range(5):
rddQueue += [ssc.sparkContext.parallelize([j for j in range(1, 1001)], 10)]
# Create the QueueInputDStream and use it do some processing
inputStream = ssc.queueStream(rddQueue)
mappedStream = inputStream.map(lambda x: (x % 10, 1))
reducedStream = mappedStream.reduceByKey(lambda a, b: a + b)
reducedStream.pprint()
ssc.start()
ssc.stop(stopSparkContext=True, stopGraceFully=True)
问题1:文件输入流创建(复制粘贴)日志文件,没有词频统计运行结果
流数据,实时计算 ,必须vim实时创建
问题1:
zzp@ubuntu:$ nc -lk 9999
nc: Address already in use
首先,你需要找出哪个进程正在使用端口9999。可以使用以下命令来查找:
sudo lsof -i :9999
使用kill
命令来终止该进程:
sudo kill -9 PID
重新启动监听窗口
nc -lk 9999
停止监听程序,否则它一直在k窗口中不断循环监听,影响下一次开启,停止的方法是,按键盘Ctrl+D,或者Ctrl+C。
二、高级输入源
1.Kafka源
安装 Kafka
运行
Spark2.1.0+入门:Apache Kafka作为DStream数据源(Python版)_厦大数据库实验室博客
cd /usr/local/kafka
./bin/zookeeper-server-start.sh config/zookeeper.properties
cd /usr/local/kafka
./bin/kafka-server-start.sh config/server.properties
//这个topic叫wordsendertest,2181是zookeeper默认的端口号,partition是topic里面的分区数,replication-factor是备份的数量,在kafka集群中使用,这里单机版就不用备份了
//可以用list列出所有创建的topics,来查看上面创建的topic是否存在
cd /usr/local/kafka
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wordsendertest
cd /usr/local/spark/mycode/streaming/kafka/
/usr/local/spark/bin/spark-submit ./KafkaWordCount.py localhost:2181 wordsendertest #topic
## KafkaWordCount.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
exit(-1)
conf = SparkConf()
conf.setAppName('PythonStreamingKafkaWordCount')
conf.setMaster('local[2]')
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 1)
zkQuorum, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
hello hadoop
hello spark
2.Flume源
安装 flume
日志采集工具Flume的安装与使用方法_厦大数据库实验室博客
运行
Spark2.1.0+入门:把Flume作为DStream数据源(Python版)_厦大数据库实验室博客
cd /usr/local/spark/mycode
mkdir flume
cd flume
vim FlumeEventCount.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.flume import FlumeUtils
import pyspark
if __name__ == "__main__":
if len(sys.argv) != 3: #3个参数(self,ip,port)localhost 44444
print("Usage: flume_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="FlumeEventCount")
ssc = StreamingContext(sc, 2)
hostname= sys.argv[1] # 参数ip
port = int(sys.argv[2]) # 参数port
stream = FlumeUtils.createStream(ssc, hostname, port,pyspark.StorageLevel.MEMORY_AND_DISK_SER_2)
stream.count().map(lambda cnt : "Recieve " + str(cnt) +" Flume events!!!!").pprint()
# 输出Recieve xx Flume events!!!!
ssc.start()
ssc.awaitTermination()
/usr/local/spark/mycode/streaming/flume/xx.py
export SPARK_DIST_CLASSPATH=
$(/usr/local/hadoop/bin/hadoopclasspath):
$(/usr/local/hbase/bin/hbase classpath):
/usr/local/spark/examples/jars/*:
/usr/local/spark/jars/kafka/*:
/usr/local/kafka/libs/*:
/usr/local/spark/jars/flume/*:
/usr/local/flume/lib/*
./bin/spark-submit --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/flume/* ./mycode/flume/FlumeEventCount.py localhost 44444
目前Flume还没有启动,没有给FlumeEventCount发送任何消息,所以Flume Events的数量是0。
再另外新建第2个终端,在这个新的终端中启动Flume Agent,该agent就会一直监听localhost的33333端口,下面就可以通过“telnet localhost 33333”命令向Flume Source发送消息
cd /usr/local/flume
bin/flume-ng agent --conf ./conf --conf-file ./conf/flume-to-spark.conf --name a1 -Dflume.root.logger=INFO,console
在这个窗口里面随便敲入若干个字符和若干个回车,这些消息都会被Flume监听到,Flume把消息采集到以后汇集到Sink,然后由Sink发送给Spark的FlumeEventCount程序进行处理。
telnet localhost 33333
然后,就可以在运行FlumeEventCount的前面那个终端窗口内看到类似如下的统计结果: