Flume入门

最新推荐文章于 2022-09-10 17:42:38 发布

原创最新推荐文章于 2022-09-10 17:42:38 发布 · 318 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#flume #big data

Hadoop 专栏收录该内容

18 篇文章

订阅专栏

Flume入门

Flume 是一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的软件

Flume 的核心是把数据从数据源(source)收集过来，再将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功，在送到目的地(sink)之前，会先缓存数据(channel),待数据真正到达目的地(sink)后flume 在删除自己缓存的数据。

Agent三个组件

Source：采集源，用于跟数据源对接，以获取数据
Sink：下沉地，采集数据的传送目的，用于往下一级 agent 传递数据或者往最终存储系统传递数据
Channel：agent 内部的数据传输通道，用于从 source 将数据传递到 sink

数据传输过程中, 流动的是event

event 从source流向channel, 再到sink
event包含 headers, body, event信息

简单版配置文件

监听telnet 44444端口

启动 : bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

只要44444端口接收到数据, flume就能接收到信息

# 定义这个 agent 中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 描述和配置 source 组件：r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# 描述和配置 sink 组件：k1
a1.sinks.k1.type = logger
# 描述和配置 channel 组件，此处使用是内存缓存的方式
a1.channels.c1.type = memory
# 通道中最大可以存储的event的数量
a1.channels.c1.capacity = 1000
# 每次最大可以从source中拿到或送到sink中的event的数量
a1.channels.c1.transactionCapacity = 100
# 描述和配置 source channel sink 之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

采集目录到HDFS

需求 : 监听服务器某个目录, 只要目录中产生新的文件, 就把文件采集到HDFS

该目录一定不能有重名文件产生否则报错且罢工

Source : spooldir
sink : hdfs
channel : memory

vim spooldir-hdfs.conf

bin/flume-ng agent -c ./conf -f ./conf/spool-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
##注意：不能往监控目录中重复丢同名文件

# 类型为监听目录
a1.sources.r1.type = spooldir
# 监听的具体目录
a1.sources.r1.spoolDir = /root/logs2
a1.sources.r1.fileHeader = true

# Describe the sink
# 采集的sink类型
a1.sinks.k1.type = hdfs
# 存到hdfs的哪个目录
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
# 存进去的文件名前缀
a1.sinks.k1.hdfs.filePrefix = events-

# 开启时间的舍弃
# 实际为控制文件夹的滚动 每隔10分钟产生新的文件夹  12:00 12:10 ...
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

# 控制文件的滚动 只要下面的三个有一个满足, 就触发
# 如果设置某个属性为0 表示这个属性失效 参考其他属性
# 时间间隔  每隔3 单位:秒 正在写的临时文件滚动成最终文件
a1.sinks.k1.hdfs.rollInterval = 3
# 文件大小  只要文件大小到达20 单位 bytes 正在写的临时文件滚动成最终文件
a1.sinks.k1.hdfs.rollSize = 20 
# event数量
a1.sinks.k1.hdfs.rollCount = 5


a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

采集文件到HDFS

需求: 监听某个文件, 只要文件内容增加, 就把追加到该文件中的数据实时采集到hdfs

source : exec
sink : hdfs
channel : memory

vim tail-hdfs.conf

bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 监听类型为 文件
a1.sources.r1.type = exec
# 监听的具体文件
a1.sources.r1.command = tail -F /root/logs2/test.log
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H-%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume串联

当涉及两级flume之间数据传递的时候使用avro即可指定传递数据的ip 和端口

当涉及多级flume串联的时候优先启动原理数据源的那级

在这里插入图片描述

Agent1

vim tail-avro-avro.conf

bin/flume-ng agent -c conf -f conf/tail-avro-avro.conf -n a1

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs2/test.log
a1.sources.r1.channels = c1

# Describe the sink
##sink端的avro是一个数据发送者
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = node-2
a1.sinks.k1.port = 4141
a1.sinks.k1.batch-size = 2


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Agent4

vim avro-logger.conf

bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a2 -Dflume.root.logger=INFO,console

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
##source中的avro组件是一个接收者服务
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = localhost
a1.sources.r1.port = 4141

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume负载均衡load_balance

负载均衡是用于解决一台机器(一个进程)无法解决所有请求而产生的一种算法

负载均衡规则：轮询（round_robin） random随机

在这里插入图片描述

Agent1

vim exec-avro-balance.conf

启动 : bin/flume-ng agent -c conf -f conf/exec-avro-balance.conf -n agent1 -Dflume.root.logger=INFO,console

#agent1 name
agent1.channels = c1
agent1.sources = r1
#sink的两个下层
agent1.sinks = k1 k2

#set gruop
agent1.sinkgroups = g1

#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/123.log

# set sink1
# 设置sink1的主机和端口
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node-2
agent1.sinks.k1.port = 52020

# set sink2
# 设置sink2的主机和端口
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node-3
agent1.sinks.k2.port = 52020

#set sink group
agent1.sinkgroups.g1.sinks = k1 k2

#set load_balance
agent1.sinkgroups.g1.processor.type = load_balance
# 如果开启，则将失败的 sink 放入黑名单
agent1.sinkgroups.g1.processor.backoff = true
# 设置负载均衡方式 轮询/随机
agent1.sinkgroups.g1.processor.selector = round_robin
 #在黑名单放置的超时时间，超时结束时，若仍然无法接收，则超时时间呈指数增长
agent1.sinkgroups.g1.processor.selector.maxTimeOut=10000

Agent2

vim avro-logger-balance.conf

启动 : bin/flume-ng agent -c conf -f conf/avro-logger-balance.conf -n a1 -Dflume.root.logger=INFO,console

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node-2
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Agent3

vim avro-logger-balance.conf

启动 : bin/flume-ng agent -c conf -f conf/avro-logger-balance.conf -n a1 -Dflume.root.logger=INFO,console

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node-3
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume容错机制failover

主要解决单点故障备份越多, 容错能力越强

类似主备, 同一时间只有一个Agent工作, 当工作中的Agent宕掉后, 第二个上

Agent1

vim exec-avro-fail.conf

启动 : bin/flume-ng agent -c conf -f conf/exec-avro-fail.conf -n agent1 -Dflume.root.logger=INFO,console

#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2

#set gruop
agent1.sinkgroups = g1

#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/456.log

# set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node-2
agent1.sinks.k1.port = 52020

# set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node-3
agent1.sinks.k2.port = 52020

#set sink group
agent1.sinkgroups.g1.sinks = k1 k2

#set failover
agent1.sinkgroups.g1.processor.type = failover
# 设置优先级 绝对值大的优先
# 如果没有指定优先级，则根据在配置中指定 Sink 的顺序来确定优先级
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
# 宕掉的agent复活时间 如果10秒没正常工作,则时间继续增加
# 只要成功发送一个event 将复活
agent1.sinkgroups.g1.processor.maxpenalty = 10000

Agent-k1

vim avro-logger-fail.conf

bin/flume-ng agent -c conf -f conf/avro-logger-fail.conf -n a1 -Dflume.root.logger=INFO,console

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node-2
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Agent-k2

vim avro-logger-fail.conf

bin/flume-ng agent -c conf -f conf/avro-logger-fail.conf -n a1 -Dflume.root.logger=INFO,console

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node-3
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume拦截器

需求 : A、B 两台日志服务机器实时生产日志主要类型为 access.log、nginx.log web.log

把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 C 机器上然后统一收集到 hdfs 中。

在这里插入图片描述

使用拦截器之后如下：在event heander中可以插入自定义kv对

服务器A Agent

vim exec_source_avro_sink.conf

启动 : bin/flume-ng agent -c conf -f conf/exec_source_avro_sink.conf -name a1 -Dflume.root.logger=DEBUG,console

# Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs1/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/logs1/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx

a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/logs1/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node-2
a1.sinks.k1.port = 41414

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

服务器C Agent

vim avro_source_hdfs_sink.conf

启动 : bin/flume-ng agent -c conf -f conf/avro_source_hdfs_sink.conf -name a1 -Dflume.root.logger=DEBUG,console

#定义agent名， source、channel、sink的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1


#定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = node-2
a1.sources.r1.port =41414

#添加时间拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder


#定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 1000

#定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=hdfs://node-1:9000/source/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
#时间类型
#a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize =0

a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

#批量写入hdfs的个数
a1.sinks.k1.hdfs.batchSize = 1
#flume操作hdfs的线程数（包括新建，写入等）
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作hdfs超时时间
a1.sinks.k1.hdfs.callTimeout=30000

#组装source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1