导读:在Spark Streaming整合Flume文档中,官方提供两种方式,push和pull
Flume是Spark Streaming的高级数据源之一直达车
Spark Streaming整合Flume官方文档直达车
如果你对Flume不熟悉,这里是我记录的Flume的基本教程直达车,欢迎到访
该文实例代码我的码云直达车
一、概述
Apache Flume是一种分布式,可靠且可用的服务,用于高效收集,聚合和移动大量日志数据。在这里,我们将解释如何配置Flume和Spark Streaming以从Flume接收数据。这有两种方法。
注意:从Spark 2.3.0开始,不推荐使用Flume支持。
二、Flume-style Push-based Approach
直达车
Flume旨在推动Flume agent之间的数据。在这种方法中,Spark Streaming基本上设置了一个接收器,它作为Flume的Avro代理,Flume可以将数据推送过来。
由于是推送模型,Spark Streaming应用程序需要先启动,接收器在所选端口上进行调度和监听,以便Flume能够推送数据。
本地开发测试
1)conf配置,直达车,这里我配置的用户本地测试的,后面会提到如何跑在Spark上
cd $FLUME_HOME/conf
编辑 vim flume-push-streaming.conf
flume-push-streaming.sources = netcat-source
flume-push-streaming.sinks = avro-sink
flume-push-streaming.channels = memory-channel
flume-push-streaming.sources.netcat-source.type = netcat
flume-push-streaming.sources.netcat-source.bind = hadoop000
flume-push-streaming.sources.netcat-source.port = 44444
flume-push-streaming.sinks.avro-sink.type = avro
flume-push-streaming.sinks.avro-sink.hostname = 192.168.31.31
flume-push-streaming.sinks.avro-sink.port = 44445
flume-push-streaming.channels.memory-channel.type = memory
flume-push-streaming.sources.netcat-source.channels = memory-channel
flume-push-streaming.sinks.avro-sink.channel = memory-channel
注意:hadoop000是我linux系统的hostname,192.168.31.31是我windows的ip
2)代码,直达车
object FlumePushWordCountTest {
def main(args: Array[String]): Unit = {
if (args.length != 2) {
System.err.println("Usage: FlumePushWordCountTest <hostname> <port>")
System.exit(1)
}
val Array(hostname, port) = args
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FlumePushWordCountTest")
val ssc = new StreamingContext(sparkConf, Seconds(5))
ssc.sparkContext.setLogLevel("ERROR")
val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)
flumeStream.map(x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.print()
ssc.start()
ssc.awaitTermination()
}
}
3)本地运行Spark Streaming,运行参数为:0.0.0.0
和 44445
,因为上面的conf中配置了sink到我的windows的44445端口
4)之后linux服务器运行Flume,如果你对Flume不熟悉,这里是我记录的Flume的基本教程直达车,欢迎到访
flume-ng agent \
--name flume-push-streaming \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume-push-streaming.conf \
-Dflume.root.logger=INFO,console
5)启动 telnet hadoop000 44444
输入数据,查看结果
服务器环境
1)部署到服务器环境运行(先运行jar包,再运行flume,和上面的本地操作流程一样)
前提:上面是本地测试的,sink到的是我windows的ip。所以上面的conf文件中sink得改一下
flume-push-streaming.sinks.avro-sink.hostname = hadoop000
2)打包,mvn clean package -DskipTests
3)上传jar包到服务器
4)运行jar包命令
./bin/spark-submit \
--class com.imooc.spark.streaming.flume.FlumePushWordCountTest \
--name FlumePushWordCountTest \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.3.2 \
/root/lib/spark-sql-1.0-jar-with-dependencies.jar \
hadoop000 44445
由于上面打包依赖没有打进去,这里指定一下(–packages)就可以了,运行的时候会自动帮你下载依赖,注意联网。注意jar包路径
提示:使用 maven-assembly-plugin
插件可以把自己想要的包打进去。
三、Pull-based Approach using a Custom Sink
直达车
和第一种不一样,Flume不将数据直接推送到Spark Streaming。
而是
Flume将数据推入接收器,数据保持缓冲 ,Spark Streaming使用可靠的Flume接收器 和事务从接收器中提取数据。只有在Spark Streaming接收和复制数据后,事务才会成功 。(我们自己取拿数据处理)
和第一种push相比,这一种具有更强的可靠性和容错性
1)conf配置,直达车,这里我配置的用户本地测试的,如何跑在Spark上和第一种方式一模一样
cd $FLUME_HOME/conf
vim flume-pull-streaming.conf
flume-pull-streaming.sources = netcat-source
flume-pull-streaming.sinks = spark-sink
flume-pull-streaming.channels = memory-channel
flume-pull-streaming.sources.netcat-source.type = netcat
flume-pull-streaming.sources.netcat-source.bind = hadoop000
flume-pull-streaming.sources.netcat-source.port = 44444
flume-pull-streaming.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
flume-pull-streaming.sinks.spark-sink.hostname = hadoop000
flume-pull-streaming.sinks.spark-sink.port = 44445
flume-pull-streaming.channels.memory-channel.type = memory
flume-pull-streaming.sources.netcat-source.channels = memory-channel
flume-pull-streaming.sinks.spark-sink.channel = memory-channel
注意:hadoop000是我linux的hostname
2)代码,直达车
object FlumePullWordCountTest {
def main(args: Array[String]): Unit = {
if (args.length != 2) {
System.err.println("Usage: FlumePushWordCountTest <hostname> <port>")
System.exit(1)
}
val Array(hostname, port) = args
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FlumePullWordCountTest")
val ssc = new StreamingContext(sparkConf, Seconds(5))
ssc.sparkContext.setLogLevel("ERROR")
val flumeStream = FlumeUtils.createPollingStream(ssc, hostname, port.toInt)
flumeStream.map(x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.print()
ssc.start()
ssc.awaitTermination()
}
}
3)先启动flume,后启动spark streaming应用
4)运行flume
flume-ng agent \
--name flume-pull-streaming \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume-pull-streaming.conf \
-Dflume.root.logger=INFO,console
5)本地运行Spark Streaming,运行参数为 192.168.31.30
44445
,分别是我linux的ip和端口
6)启动 telnet hadoop000 44444
输入数据,查看结果