Flume入门
Flume 是 一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的软件
Flume 的核心是把数据从数据源(source)收集过来,再将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(sink)后flume 在删除自己缓存的数据。
Agent三个组件
- Source:采集源,用于跟数据源对接,以获取数据
- Sink:下沉地,采集数据的传送目的,用于往下一级 agent 传递数据或者往最终存储系统传递数据
- Channel:agent 内部的数据传输通道,用于从 source 将数据传递到 sink
数据传输过程中, 流动的是event
- event 从source流向channel, 再到sink
- event包含 headers, body, event信息
简单版配置文件
监听telnet 44444端口
启动 : bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console
只要44444端口接收到数据, flume就能接收到信息
# 定义这个 agent 中各组件的名字 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 描述和配置 source 组件:r1 a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # 描述和配置 sink 组件:k1 a1.sinks.k1.type = logger # 描述和配置 channel 组件,此处使用是内存缓存的方式 a1.channels.c1.type = memory # 通道中最大可以存储的event的数量 a1.channels.c1.capacity = 1000 # 每次最大可以从source中拿到或送到sink中的event的数量 a1.channels.c1.transactionCapacity = 100 # 描述和配置 source channel sink 之间的连接关系 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
采集目录到HDFS
需求 : 监听服务器某个目录, 只要目录中产生新的文件, 就把文件采集到HDFS
该目录一定不能有重名文件产生 否则报错 且罢工
- Source : spooldir
- sink : hdfs
- channel : memory
vim spooldir-hdfs.conf
bin/flume-ng agent -c ./conf -f ./conf/spool-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source ##注意:不能往监控目录中重复丢同名文件 # 类型为监听目录 a1.sources.r1.type = spooldir # 监听的具体目录 a1.sources.r1.spoolDir = /root/logs2 a1.sources.r1.fileHeader = true # Describe the sink # 采集的sink类型 a1.sinks.k1.type = hdfs # 存到hdfs的哪个目录 a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/ # 存进去的文件名前缀 a1.sinks.k1.hdfs.filePrefix = events- # 开启时间的舍弃 # 实际为控制文件夹的滚动 每隔10分钟产生新的文件夹 12:00 12:10 ... a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute # 控制文件的滚动 只要下面的三个有一个满足, 就触发 # 如果设置某个属性为0 表示这个属性失效 参考其他属性 # 时间间隔 每隔3 单位:秒 正在写的临时文件滚动成最终文件 a1.sinks.k1.hdfs.rollInterval = 3 # 文件大小 只要文件大小到达20 单位 bytes 正在写的临时文件滚动成最终文件 a1.sinks.k1.hdfs.rollSize = 20 # event数量 a1.sinks.k1.hdfs.rollCount = 5 a1.sinks.k1.hdfs.batchSize = 1 a1.sinks.k1.hdfs.useLocalTimeStamp = true #生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本 a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
采集文件到HDFS
需求: 监听某个文件, 只要文件内容增加, 就把追加到该文件中的数据实时采集到hdfs
- source : exec
- sink : hdfs
- channel : memory
vim tail-hdfs.conf
bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source # 监听类型为 文件 a1.sources.r1.type = exec # 监听的具体文件 a1.sources.r1.command = tail -F /root/logs2/test.log a1.sources.r1.channels = c1 # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1 a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H-%M/ a1.sinks.k1.hdfs.filePrefix = events- a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute a1.sinks.k1.hdfs.rollInterval = 3 a1.sinks.k1.hdfs.rollSize = 20 a1.sinks.k1.hdfs.rollCount = 5 a1.sinks.k1.hdfs.batchSize = 1 a1.sinks.k1.hdfs.useLocalTimeStamp = true #生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本 a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
flume串联
当涉及两级flume之间数据传递的时候 使用avro即可 指定传递数据的ip 和端口
当涉及多级flume串联的时候 优先启动原理数据源的那级
Agent1
vim tail-avro-avro.conf
bin/flume-ng agent -c conf -f conf/tail-avro-avro.conf -n a1
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /root/logs2/test.log a1.sources.r1.channels = c1 # Describe the sink ##sink端的avro是一个数据发送者 a1.sinks = k1 a1.sinks.k1.type = avro a1.sinks.k1.channel = c1 a1.sinks.k1.hostname = node-2 a1.sinks.k1.port = 4141 a1.sinks.k1.batch-size = 2 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Agent4
vim avro-logger.conf
bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a2 -Dflume.root.logger=INFO,console
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source ##source中的avro组件是一个接收者服务 a1.sources.r1.type = avro a1.sources.r1.channels = c1 a1.sources.r1.bind = localhost a1.sources.r1.port = 4141 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
flume负载均衡load_balance
负载均衡是用于解决一台机器(一个进程)无法解决所有请求而产生的一种算法
负载均衡规则:轮询(round_robin) random随机
Agent1
vim exec-avro-balance.conf
启动 : bin/flume-ng agent -c conf -f conf/exec-avro-balance.conf -n agent1 -Dflume.root.logger=INFO,console
#agent1 name agent1.channels = c1 agent1.sources = r1 #sink的两个下层 agent1.sinks = k1 k2 #set gruop agent1.sinkgroups = g1 #set channel agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 agent1.sources.r1.channels = c1 agent1.sources.r1.type = exec agent1.sources.r1.command = tail -F /root/logs/123.log # set sink1 # 设置sink1的主机和端口 agent1.sinks.k1.channel = c1 agent1.sinks.k1.type = avro agent1.sinks.k1.hostname = node-2 agent1.sinks.k1.port = 52020 # set sink2 # 设置sink2的主机和端口 agent1.sinks.k2.channel = c1 agent1.sinks.k2.type = avro agent1.sinks.k2.hostname = node-3 agent1.sinks.k2.port = 52020 #set sink group agent1.sinkgroups.g1.sinks = k1 k2 #set load_balance agent1.sinkgroups.g1.processor.type = load_balance # 如果开启,则将失败的 sink 放入黑名单 agent1.sinkgroups.g1.processor.backoff = true # 设置负载均衡方式 轮询/随机 agent1.sinkgroups.g1.processor.selector = round_robin #在黑名单放置的超时时间,超时结束时,若仍然无法接收,则超时时间呈指数增长 agent1.sinkgroups.g1.processor.selector.maxTimeOut=10000
Agent2
vim avro-logger-balance.conf
启动 : bin/flume-ng agent -c conf -f conf/avro-logger-balance.conf -n a1 -Dflume.root.logger=INFO,console
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = avro a1.sources.r1.channels = c1 a1.sources.r1.bind = node-2 a1.sources.r1.port = 52020 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Agent3
vim avro-logger-balance.conf
启动 : bin/flume-ng agent -c conf -f conf/avro-logger-balance.conf -n a1 -Dflume.root.logger=INFO,console
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = avro a1.sources.r1.channels = c1 a1.sources.r1.bind = node-3 a1.sources.r1.port = 52020 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
flume容错机制failover
主要解决单点故障 备份越多, 容错能力越强
类似主备, 同一时间只有一个Agent工作, 当工作中的Agent宕掉后, 第二个上
Agent1
vim exec-avro-fail.conf
启动 : bin/flume-ng agent -c conf -f conf/exec-avro-fail.conf -n agent1 -Dflume.root.logger=INFO,console
#agent1 name agent1.channels = c1 agent1.sources = r1 agent1.sinks = k1 k2 #set gruop agent1.sinkgroups = g1 #set channel agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 agent1.sources.r1.channels = c1 agent1.sources.r1.type = exec agent1.sources.r1.command = tail -F /root/logs/456.log # set sink1 agent1.sinks.k1.channel = c1 agent1.sinks.k1.type = avro agent1.sinks.k1.hostname = node-2 agent1.sinks.k1.port = 52020 # set sink2 agent1.sinks.k2.channel = c1 agent1.sinks.k2.type = avro agent1.sinks.k2.hostname = node-3 agent1.sinks.k2.port = 52020 #set sink group agent1.sinkgroups.g1.sinks = k1 k2 #set failover agent1.sinkgroups.g1.processor.type = failover # 设置优先级 绝对值大的优先 # 如果没有指定优先级,则根据在配置中指定 Sink 的顺序来确定优先级 agent1.sinkgroups.g1.processor.priority.k1 = 10 agent1.sinkgroups.g1.processor.priority.k2 = 1 # 宕掉的agent复活时间 如果10秒没正常工作,则时间继续增加 # 只要成功发送一个event 将复活 agent1.sinkgroups.g1.processor.maxpenalty = 10000
Agent-k1
vim avro-logger-fail.conf
bin/flume-ng agent -c conf -f conf/avro-logger-fail.conf -n a1 -Dflume.root.logger=INFO,console
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = avro a1.sources.r1.channels = c1 a1.sources.r1.bind = node-2 a1.sources.r1.port = 52020 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Agent-k2
vim avro-logger-fail.conf
bin/flume-ng agent -c conf -f conf/avro-logger-fail.conf -n a1 -Dflume.root.logger=INFO,console
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = avro a1.sources.r1.channels = c1 a1.sources.r1.bind = node-3 a1.sources.r1.port = 52020 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
flume拦截器
需求 : A、B 两台日志服务机器实时生产日志主要类型为 access.log、nginx.log web.log
把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 C 机器上然后统一收集到 hdfs 中。
使用拦截器之后如下:在event heander中可以插入自定义kv对
服务器A Agent
vim exec_source_avro_sink.conf
启动 : bin/flume-ng agent -c conf -f conf/exec_source_avro_sink.conf -name a1 -Dflume.root.logger=DEBUG,console
# Name the components on this agent a1.sources = r1 r2 r3 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /root/logs1/access.log a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = static a1.sources.r1.interceptors.i1.key = type a1.sources.r1.interceptors.i1.value = access a1.sources.r2.type = exec a1.sources.r2.command = tail -F /root/logs1/nginx.log a1.sources.r2.interceptors = i2 a1.sources.r2.interceptors.i2.type = static a1.sources.r2.interceptors.i2.key = type a1.sources.r2.interceptors.i2.value = nginx a1.sources.r3.type = exec a1.sources.r3.command = tail -F /root/logs1/web.log a1.sources.r3.interceptors = i3 a1.sources.r3.interceptors.i3.type = static a1.sources.r3.interceptors.i3.key = type a1.sources.r3.interceptors.i3.value = web # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = node-2 a1.sinks.k1.port = 41414 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sources.r2.channels = c1 a1.sources.r3.channels = c1 a1.sinks.k1.channel = c1
服务器C Agent
vim avro_source_hdfs_sink.conf
启动 : bin/flume-ng agent -c conf -f conf/avro_source_hdfs_sink.conf -name a1 -Dflume.root.logger=DEBUG,console
#定义agent名, source、channel、sink的名称 a1.sources = r1 a1.sinks = k1 a1.channels = c1 #定义source a1.sources.r1.type = avro a1.sources.r1.bind = node-2 a1.sources.r1.port =41414 #添加时间拦截器 a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder #定义channels a1.channels.c1.type = memory a1.channels.c1.capacity = 2000 a1.channels.c1.transactionCapacity = 1000 #定义sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path=hdfs://node-1:9000/source/logs/%{type}/%Y%m%d a1.sinks.k1.hdfs.filePrefix =events a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text #时间类型 #a1.sinks.k1.hdfs.useLocalTimeStamp = true a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.rollInterval = 10 a1.sinks.k1.hdfs.rollSize =0 a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute #批量写入hdfs的个数 a1.sinks.k1.hdfs.batchSize = 1 #flume操作hdfs的线程数(包括新建,写入等) a1.sinks.k1.hdfs.threadsPoolSize=10 #操作hdfs超时时间 a1.sinks.k1.hdfs.callTimeout=30000 #组装source、channel、sink a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1