Spark Streaming输入源

目录

一、基本输入源:

1.文件输入流 

(1)spark-shell  

(2)独立运行程序

2.套接字流

3.RDD队列流

二、高级输入源 

1.Kafka源

安装 Kafka

运行

 2.Flume源

安装 flume

运行 


一、基本输入源:

1.文件输入流 

(1)spark-shell  

 打开一个终端窗口1,(任意目录)启动进入pyspark,

pyspark
// 代码同下

 监听程序只监听"…/streaming/logfile"目录下在程序启动后新增的文件,不会去处理历史上已经存在的文件。

  logfile中新建日志文件:请打开另外一个终端窗口2,在当前目录下再新建一个log1.txt文件(复制粘贴之前的文件无效),

vim log.txt

运行结果在终端窗口2 

(2)独立运行程序

python3 TestStreaming.py
from operator import add
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
conf = SparkConf()
conf.setAppName('TestDStream')
conf.setMaster('local[2]')
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 10)
lines = ssc.textFileStream('file:///home/zzp/PycharmProjects/streaming/logfile')
words = lines.flatMap(lambda line: line.split(' '))
wordCounts = words.map(lambda x : (x,1)).reduceByKey(add)
wordCounts.pprint()
ssc.start()
ssc.awaitTermination()

 请打开另外一个终端窗口2,在当前目录logfile中再新建一个log1.txt文件(vim创建)

2.套接字流

 新建NetworkWordCount.py代码文件

 在当前文件目录下打开终端窗口1,开启监听窗口

sudo nc -lk 9999

 在当前文件目录下打开终端窗口2,运行py文件

python3 NetworkWordCount.py localhost 9999
#/usr/local/spark/mycode/streaming/socket/NetworkWordCount.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: NetworkWordCount.py <hostname> <port>", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonStreamingNetworkWordCount")
    ssc = StreamingContext(sc, 1)
    lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
    counts = lines.flatMap(lambda line: line.split(" "))\
                  .map(lambda word: (word, 1))\
                  .reduceByKey(lambda a, b: a+b)
    counts.pprint()
    ssc.start()
    ssc.awaitTermination()

 在nc第一个终端窗口窗口1中随意输入一些单词,回车

 在nc第二个终端窗口窗口2中出现词频统计的运行结果

3.RDD队列流

 新建一个TestRDDQueueStream.py文件

 在当前文件目录下打开终端窗口1,运行py文件

python3 RDDQueueStream.py
#!/usr/bin/env python3
import time
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
 
if __name__ == "__main__":
    sc = SparkContext(appName="PythonStreamingQueueStream")
    ssc = StreamingContext(sc, 2)
 
    # Create the queue through which RDDs can be pushed to
    # a QueueInputDStream
    rddQueue = []
    for i in range(5):
        rddQueue += [ssc.sparkContext.parallelize([j for j in range(1, 1001)], 10)]
 
    # Create the QueueInputDStream and use it do some processing
    inputStream = ssc.queueStream(rddQueue)
    mappedStream = inputStream.map(lambda x: (x % 10, 1))
    reducedStream = mappedStream.reduceByKey(lambda a, b: a + b)
    reducedStream.pprint()
 
    ssc.start()
    ssc.stop(stopSparkContext=True, stopGraceFully=True)

问题1:文件输入流创建(复制粘贴)日志文件,没有词频统计运行结果

流数据,实时计算 ,必须vim实时创建

问题1:

zzp@ubuntu:$ nc -lk 9999
nc: Address already in use

 首先,你需要找出哪个进程正在使用端口9999。可以使用以下命令来查找:

sudo lsof -i :9999

使用kill命令来终止该进程:

sudo kill -9 PID

重新启动监听窗口

nc -lk 9999

 停止监听程序,否则它一直在k窗口中不断循环监听,影响下一次开启,停止的方法是,按键盘Ctrl+D,或者Ctrl+C。 

二、高级输入源 

1.Kafka源

安装 Kafka

 Kafka的安装和简单实例测试_厦大数据库实验室博客

运行

Spark2.1.0+入门:Apache Kafka作为DStream数据源(Python版)_厦大数据库实验室博客

cd  /usr/local/kafka
./bin/zookeeper-server-start.sh config/zookeeper.properties

cd  /usr/local/kafka
./bin/kafka-server-start.sh config/server.properties

//这个topic叫wordsendertest,2181是zookeeper默认的端口号,partition是topic里面的分区数,replication-factor是备份的数量,在kafka集群中使用,这里单机版就不用备份了
//可以用list列出所有创建的topics,来查看上面创建的topic是否存在

cd /usr/local/kafka
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wordsendertest
cd  /usr/local/spark/mycode/streaming/kafka/
/usr/local/spark/bin/spark-submit ./KafkaWordCount.py localhost:2181 wordsendertest #topic
## KafkaWordCount.py

from __future__ import print_function 
import sys 
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
 
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
        exit(-1)   
        conf = SparkConf()
        conf.setAppName('PythonStreamingKafkaWordCount')
        conf.setMaster('local[2]')
        sc = SparkContext(conf = conf)
    ssc = StreamingContext(sc, 1)
 
    zkQuorum, topic = sys.argv[1:]
    kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a+b)
    counts.pprint()
 
    ssc.start()
    ssc.awaitTermination()

hello hadoop
hello spark

 2.Flume源

安装 flume

 日志采集工具Flume的安装与使用方法_厦大数据库实验室博客

运行 

 Spark2.1.0+入门:把Flume作为DStream数据源(Python版)_厦大数据库实验室博客

cd /usr/local/spark/mycode
mkdir flume
cd flume
vim FlumeEventCount.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.flume import FlumeUtils
import pyspark
if __name__ == "__main__":
    if len(sys.argv) != 3:    #3个参数(self,ip,port)localhost 44444
        print("Usage: flume_wordcount.py <hostname> <port>", file=sys.stderr)
        exit(-1)
 
    sc = SparkContext(appName="FlumeEventCount")
    ssc = StreamingContext(sc, 2)
 
    hostname= sys.argv[1]    # 参数ip
    port = int(sys.argv[2])  # 参数port
    stream = FlumeUtils.createStream(ssc, hostname, port,pyspark.StorageLevel.MEMORY_AND_DISK_SER_2)
    stream.count().map(lambda cnt : "Recieve " + str(cnt) +" Flume events!!!!").pprint()
    # 输出Recieve xx Flume events!!!!
    ssc.start()
    ssc.awaitTermination()

 /usr/local/spark/mycode/streaming/flume/xx.py

export SPARK_DIST_CLASSPATH=
$(/usr/local/hadoop/bin/hadoopclasspath):
$(/usr/local/hbase/bin/hbase classpath):
/usr/local/spark/examples/jars/*:
/usr/local/spark/jars/kafka/*:
/usr/local/kafka/libs/*:
/usr/local/spark/jars/flume/*:
/usr/local/flume/lib/*

./bin/spark-submit --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/flume/* ./mycode/flume/FlumeEventCount.py localhost 44444

目前Flume还没有启动,没有给FlumeEventCount发送任何消息,所以Flume Events的数量是0。

 再另外新建第2个终端,在这个新的终端中启动Flume Agent,该agent就会一直监听localhost的33333端口,下面就可以通过“telnet localhost 33333”命令向Flume Source发送消息

cd /usr/local/flume
bin/flume-ng agent --conf ./conf --conf-file ./conf/flume-to-spark.conf --name a1 -Dflume.root.logger=INFO,console

在这个窗口里面随便敲入若干个字符和若干个回车,这些消息都会被Flume监听到,Flume把消息采集到以后汇集到Sink,然后由Sink发送给Spark的FlumeEventCount程序进行处理。 

telnet localhost 33333

然后,就可以在运行FlumeEventCount的前面那个终端窗口内看到类似如下的统计结果:

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值