Flume基础入门

最新推荐文章于 2024-10-13 14:12:46 发布

原创最新推荐文章于 2024-10-13 14:12:46 发布 · 341 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#flume基础

flume 专栏收录该内容

0 篇文章

订阅专栏

本文详细介绍Flume的日志采集机制，包括netcat、exec、spool dir、Avro、Kafka、taildir等source类型，memory、file、Kafka等channel类型，以及HDFS、Avro等sink类型。深入探讨每种类型的特性和配置参数。

1.Flume简介

flume是一个日志采集工具，使用Event在中间传输，event可以携带headers，以byte数组的形式发送。为了保证一定传输成功，flume每个agent都有俩个事务，一个是source和channel之间的事务，一个是channel和sink之间的事务。
在数据传输的过程中，在到目的地之前会先缓存数据，直至成功以后才会删除数据

2.source类型

netcat source

工作机制

启动一个socket服务，监听一个端口；将端口上收到的数据，转成event写入channel

配置文件

a1.sources = s1
a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 44444
a1.sources.s1.channels = c1

exec source

工作机制

启动一个用户所指定的linux shell命令；
采集一个用户指定的linux shell命令的输出，作为收集到的数据，转为event写入channel；

配置文件

a1.sources = s1
a1.sources.s1.channels = c1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/weblog/access.log 
a1.sources.s1.batchSize = 100

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

缺点

当宕机以后，不会记录宕机前数据采集的偏移量，容易造成数据的丢失

spool dir source

工作机制

监视一个指定的文件夹，如果文件夹下有没采集过的新文件，则将这些新文件中的数据采集，并转成event写入channel；
注意：spooling目录中的文件必须是不可变的，而且是不能重名的！否则，source会loudly fail！

配置文件

a1.sources = s1
a1.sources.s1.channels = c1
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /root/weblog 
a1.sources.s1.batchSize = 200

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

spooling在宕机以后会记录宕机前数据采集的便宜量，所以是一个安全的采集source

avro source

工作机制

启动一个服务，监听一个端口，收集端口上收到的avro序列化数据流！
该source中拥有avro的反序列化器，能够将收到的二进制流进行正确反序列化，并装入一个event写入channel！

配置文件

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141


a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

kafka source

工作机制

Kafka source的工作机制：就是用kafka consumer连接kafka，读取数据，然后转换成event，写入channel

配置文件

a1.sources = s1
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.channels = c1
a1.sources.s1.batchSize = 100
a1.sources.s1.batchDurationMillis = 2000
a1.sources.s1.kafka.bootstrap.servers = c701:9092,c702:9092,c703:9092
a1.sources.s1.kafka.topics = TAOGE
a1.sources.s1.kafka.consumer.group.id = g1

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

taildir source

工作机制

监视指定目录下的一批文件，只要某个文件中有新写入的行，则会被tail到
它会记录每一个文件所tail到的位置，记录到一个指定的positionfile保存目录中，格式为json（如果需要的时候，可以人为修改，就可以让source从任意指定的位置开始读取数据）
它对采集完成的文件，不会做任何修改（比如重命名，删除……）

配置文件

a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /root/flumedata/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /root/weblog/access.log
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200
a1.channels.c1.transactionCapacity = 100

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

注意

taildir是一个可靠的source，采集数据的时候一般选择使用taildir

3.channel

memory channel

特性

事件被存储在实现配置好容量的内存（队列）中。
速度快，但可靠性较低，有可能会丢失数据

配置参数

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

file channel

特性

event被缓存在本地磁盘文件中
可靠性高，不会丢失
但在极端情况下可能会重复数据

配置参数

a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /root/taildir_chkp/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /root/weblog/access.log
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000


a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.capacity = 1000000
a1.channels.c1.transactionCapacity = 100
a1.channels.c1.checkpointDir = /root/flume_chkp
a1.channels.c1.dataDirs = /root/flume_data

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

kafka channel

特性

agent利用kafka作为channel数据缓存
kafka channel要跟 kafka source、 kafka sink区别开来
kafka channel在应用时，可以没有source | 或者可以没有sink

配置参数

a1.sources = r1
# 配置两个channel,为了便于观察
# c1 是kafkachannel ,c2是一个内存channel
a1.channels = c1 c2
a1.sinks = k1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/a.log
a1.sources.r1.channels = c1 c2

# kafka-channel具体配置，该channel没有sink
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = h1:9092,h2:9092,h3:9092
a1.channels.c1.parseAsFlumeEvent = false

# 内存channel 配置,并对接一个logger sink来观察
a1.channels.c2.type = memory
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c2

4.sink

hdfs sink

特性

数据被最终发往hdfs
可以生成text文件或 sequence 文件，而且支持压缩；
支持生成文件的周期性roll机制：基于文件size，或者时间间隔，或者event数量；
目标路径，可以使用动态通配符替换，比如用%D代表当前日期；

配置参数

## 定义
a1.sources = r1
a1.sinks = k1
a1.channels = c1

## source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/a.log
a1.sources.r1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

## channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000000

## sink
a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://h1:8020/doitedu/%Y-%m-%d/%H-%M
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

a1.sinks.k1.hdfs.filePrefix = doit_
a1.sinks.k1.hdfs.fileSuffix = .log.gz

a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 102400
a1.sinks.k1.hdfs.rollCount = 0

a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip
a1.sinks.k1.hdfs.writeFormat = Text

avro sink

特性

avro sink用来向avro source发送avro序列化数据，这样就可以实现agent之间的级联

配置参数

a1.sources = s1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/weblog/access.log
a1.sources.s1.channels = c1

a1.channels = c1
a1.channels.c1.type = memory

a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = c701
a1.sinks.k1.port = 4545