Flume数据流处理详解-优快云博客

本文链接：https://blog.youkuaiyun.com/ILovePythonhao/article/details/108182293

Flume使用Event对象来作为传递数据的格式，是内部数据传输的最基本单元。
flume内部有一个或者多个agent，每一个agent都是一个独立的守护进程（JVM)。一个agent可以包含三部分，分别为source、channel、sink，一个source能够对接多个channel，但是一个channel只能对接一个sink，其中channel是一个短暂的存储容器。
可以通过参数设置event的最大个数
Flume通常选择FileChannel，而不使用Memory Channel
Memory Channel：内存存储事务，吞吐率极高，但存在丢数据风险
– File Channel：本地磁盘的事务实现模式，保证数据不会丢失（WAL实现）
WAL:write ahead logging 预写日志
先在日志文件中写入执行的操作过程，然后再写数据，当数据写失败之后，再执行一遍操作，保证数据的不丢失。
• Sink会将事件从Channel中移除，并将事件放置到外部数据介质上
– 例如：通过Flume HDFS Sink将数据放置到HDFS中，或者放置到下一个Flume的Source，等到下一个flume处理。Sink成功取出Event后，将Event从Channel中移除。Sink必须作用于一个确切的Channel。
不同类型的Sink：
1、存储Event到最终目的的终端：HDFS、Hbase
2、自动消耗：Null Sink
3、用于Agent之间通信：Avro
flume的开发主要是编写conf文件
最简单的一个conf文件：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source,指定source为监听端口
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.bind = localhost
a1.sources.r1.port = 9000
#a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory		#指定channel的临时存储位置为内存
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

#带有正则的过滤器，拦截器的定义是在source端的
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#指定拦截器
a1.sources.r1.interceptors = i1  
a1.sources.r1.interceptors.i1.type =regex_filter  
#使用拦截器进行过滤
a1.sources.r1.interceptors.i1.regex =^[0-9]*$  
a1.sources.r1.interceptors.i1.excludeEvents =true

# Describe the sink
#a1.sinks.k1.type = logger
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs:/flume/events
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream		#以明文的形式写入到hdfs中
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

#两个flume串联起来的情况
#第一个flume的代码：
#Name the components on this agent
a2.sources= r1
a2.sinks= k1
a2.channels= c1

#Describe/configure the source
a2.sources.r1.type= netcat
a2.sources.r1.bind= localhost		#绑定的为本机
a2.sources.r1.port = 44444
a2.sources.r1.channels= c1

#Use a channel which buffers events in memory
a2.channels.c1.type= memory
a2.channels.c1.keep-alive= 10
a2.channels.c1.capacity= 100000
a2.channels.c1.transactionCapacity= 100000

#Describe/configure the source
a2.sinks.k1.type= avro			#指定为avro类型
a2.sinks.k1.channel= c1
a2.sinks.k1.hostname= slave3		#把数据传到的位置为slave3机器
a2.sinks.k1.port= 44444				#指定该机器的端口号

#第二个flume的conf文件
#Name the components on this agent
a1.sources= r1
a1.sinks= k1
a1.channels= c1

#Describe/configure the source
a1.sources.r1.type= avro
a1.sources.r1.channels= c1
a1.sources.r1.bind= slave3			#监控slave3机器
a1.sources.r1.port= 44444

#Describe the sink
a1.sinks.k1.type= logger		#以logger日志的形式，显示在终端
a1.sinks.k1.channel = c1

#Use a channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.keep-alive= 10
a1.channels.c1.capacity= 100000
a1.channels.c1.transactionCapacity= 100000
#两个的执行步骤先启动第二个flume的conf文件

启动flume的命令：

bin/flume-ng agent -c conf -f conf/push.conf -n a1 -Dflume.root.logger=INFO,console

当两个都启动完成之后，查看先启动的那个flume的日志文件，看一下是否有提示连接成功的字样。

#用Kafka接收flume的sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/Documents/code/python/flume_exec_test.txt

# 设置kafka接收器 
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
# 设置kafka的broker地址和端口号
a1.sinks.k1.brokerList=master:9092
# 设置Kafka的topic
a1.sinks.k1.topic=topicName
# 设置序列化的方式
a1.sinks.k1.serializer.class=kafka.serializer.StringEncoder

# use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

#zookeeper后面加一个host：port和加多个hosts是一样的，但是如果加的这个节点的zookeeper进程挂掉，就会影响Kafka访问zookeeper
Kafka消费flume的步骤
#启动Kafka需要先启动zookeeper，先进入zookeeper的目录下，然后执行下面操作：
./bin/zkServer.sh start     #三个节点分别启动
#然后通过下面命令启动zookeeper客户端
./bin/zkCli.sh -server master:2181,slave1:2181,slave2:2181
#启动Kafka，进入到Kafka目录下，执行：
./bin/kafka-server-start.sh config/server.properties
#查看是否存在相应的topic，如果不存在则新建topic，如果存在则继续往下执行
#使用consumer消费数据
./bin/kafka-console-consumer.sh --zookeeper master:2181,slave1:2181,slave2:2181 --topic test --from-beginning
#启动flume，启动命令如下：
bin/flume-ng agent -c conf -f conf/push.conf -n a1 -Dflume.root.logger=INFO,console

#对于之后得到的数据中，会显示在启动消费者的终端，而启动Kafka的终端不会出来数据，但是会有一些执行日志

#zookeeper的一些相关操作：
#创建topic：
bin/kafka-topics.sh --create --zookeeper master:2181 --replication-factor 1 --partitions 2 --topic test
#查看topic
bin/kafka-topic.sh --list --zookeeper master:2181

#flume与hive的连接
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/Documents/code/python/flume_exec_test.txt
#a1.sources.r1.type=netcat
#a1.sources.r1.bind=master
#a1.sources.r1.port=44444

# hive sink
a1.sinks.k1.type=hive
a1.sinks.k1.hive.metastore=thrift://master:9083
a1.sinks.k1.hive.database=test
a1.sinks.k1.hive.table=flume_test
#a1.sinks.k1.hive.partition = eval_set
#a1.sinks.sink1.hive.txnsPerBatchAsk = 2
#a1.sinks.k1.useLocalTimeStamp = false
#a1.sinks.k1.round = true
#a1.sinks.k1.roundValue = 10
#a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter=","
a1.sinks.k1.serializer.serdeSeparator=','
a1.sinks.k1.serializer.fieldnames = order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order

# use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1