1 多个应用通过连接同一个Flume Agent
参考logback与flume集成章节的描述,很容想到各个系统使用同一个logback和flume的配置文件,轻松的实现了将日志数据写入HDFS中的效果。

这种设计的问题也很凸显,单点故障。
2 Flume Agent集群
很容易就想到了,使用haproxy做代理

haproxy配置也较为简单
frontend flume_hdfs_front
bind *:9095
mode tcp
log global
option tcplog
timeout client 3600s
backlog 4096
maxconn 1000000
default_backend flume_hdfs_back
backend flume_hdfs_back
mode tcp
option log-health-checks
option redispatch
option tcplog
balance roundrobin
timeout connect 1s
timeout queue 5s
timeout server 3600s
balance roundrobin
server f1 192.168.5.174:44444 check inter 2000 rise 3 fall 3 weight 1
server f2 192.168.5.173:44444 check inter 2000 rise 3 fall 3 weight 1
我设置的一个文件大小是10M,当我将其中一台机器174停掉后,我发现每台flume会生成自己的文件,而不是在原来的文件中追加。

当我再把174启动,把173停掉时,一条新的记录又产生了。

方案2中,Flume Agent直连HDFS的方式,当多个Flume Agent写入HDFS的时候,HDFS的namenode就会产生巨大的压力。
3 Flume根据不同的业务写入不同的数据库

这里还是将flume作为hbase的前置节点,探索flume到一定程度,就不用重复造轮子了,前人走过的路,跟着摔跟头是不明智的,flume高并发优化——(2)精简结构,以及 flume高并发优化——(4)kafka channel
我们公司的flume拓扑结构还需要精简
4 使用kafka channel,写入hbase

这种方式写多个sink对消费同一个channel,source传递来的数据只会在某一个sink中执行,执行成功后另外的sink不会重复执行.
# read from kafka and write to hbase
dzm-agent.sources = dzm-source
dzm-agent.channels = dzm-channel dzm-channel-detail
dzm-agent.sinks = dzm-sink dzm-sink-detail
# source
dzm-agent.sources.dzm-source.type=avro
dzm-agent.sources.dzm-source.bind=0.0.0.0
dzm-agent.sources.dzm-source.port=44443
#dzm-agent.sources.dzm-source.selector.type = replicating
# channel
dzm-agent.channels.dzm-channel.type = org.apache.flume.channel.kafka.KafkaChannel
dzm-agent.channels.dzm-channel.kafka.bootstrap.servers = ceshi185:19092,ceshi186:19092,ceshi185:19092
dzm-agent.channels.dzm-channel.kafka.topic = flume_dzm_channel
dzm-agent.channels.dzm-channel.kafka.consumer.group.id = flume_dzm_channel
# sink
dzm-agent.sinks.dzm-sink.type = asynchbase
dzm-agent.sinks.dzm-sink.table = t_invoice_ticket
dzm-agent.sinks.dzm-sink.columnFamily = i
dzm-agent.sinks.dzm-sink.serializer = com.bwjf.flume.invoice.dzm.sink.InvoiceHbaseSerializer
# detail-channel
dzm-agent.channels.dzm-channel-detail.type = org.apache.flume.channel.kafka.KafkaChannel
dzm-agent.channels.dzm-channel-detail.kafka.bootstrap.servers = ceshi185:19092,ceshi186:19092,ceshi185:19092
dzm-agent.channels.dzm-channel-detail.kafka.topic = flume_dzm_detail_channel
dzm-agent.channels.dzm-channel-detail.kafka.consumer.group.id = flume_dzm_detail_channel
# detail-sink
dzm-agent.sinks.dzm-sink-detail.type = asynchbase
dzm-agent.sinks.dzm-sink-detail.table = t_invoice_detail_ticket
dzm-agent.sinks.dzm-sink-detail.columnFamily = i
dzm-agent.sinks.dzm-sink-detail.serializer = com.bwjf.flume.invoice.dzm.sink.InvoiceDetailHbaseSerializer
# assemble
dzm-agent.sources.dzm-source.channels = dzm-channel dzm-channel-detail
dzm-agent.sinks.dzm-sink.channel = dzm-channel
dzm-agent.sinks.dzm-sink-detail.channel = dzm-channel-detail
5 sink组
sink组中所有sink是不会同时被激活,任何时候只有它们中的一个用来发送数据,因此sink组不应该用来更快的清除channel。
# read from kafka and write to hbase
dzm-agent.sources = dzm-source
dzm-agent.channels = dzm-channel dzm-channel-detail
dzm-agent.sinks = dzm-sink1 dzm-sink2 dzm-sink3 dzm-sink-detail1 dzm-sink-detail2 dzm-sink-detail3
dzm-agent.sinkGroups = dzm-sg dzm-sg-detail
# source
dzm-agent.sources.dzm-source.type=avro
dzm-agent.sources.dzm-source.bind=0.0.0.0
dzm-agent.sources.dzm-source.port=44443
#dzm-agent.sources.dzm-source.selector.type = replicating
# channel
dzm-agent.channels.dzm-channel.type = org.apache.flume.channel.kafka.KafkaChannel
dzm-agent.channels.dzm-channel.kafka.bootstrap.servers = ceshi185:19092,ceshi186:19092,ceshi185:19092
dzm-agent.channels.dzm-channel.kafka.topic = flume_dzm_channel
dzm-agent.channels.dzm-channel.kafka.consumer.group.id = flume_dzm_channel
# sink group
dzm-agent.sinkGroups = dzm-sg
dzm-agent.sinkGroups.dzm-sg.sinks = dzm-sink1 dzm-sink2 dzm-sink3
dzm-agent.sinkGroups.dzm-sg.processor.type = load_balance
dzm-agent.sinkGroups.dzm-sg.processor.backoff = load_balance
# sink
dzm-agent.sinks.dzm-sink1.type = asynchbase
dzm-agent.sinks.dzm-sink1.table = t_invoice_ticket
dzm-agent.sinks.dzm-sink1.columnFamily = i
dzm-agent.sinks.dzm-sink1.serializer = com.bwjf.flume.invoice.dzm.sink.InvoiceHbaseSerializer
dzm-agent.sinks.dzm-sink2.type = asynchbase
dzm-agent.sinks.dzm-sink2.table = t_invoice_ticket
dzm-agent.sinks.dzm-sink2.columnFamily = i
dzm-agent.sinks.dzm-sink2.serializer = com.bwjf.flume.invoice.dzm.sink.InvoiceHbaseSerializer
dzm-agent.sinks.dzm-sink3.type = asynchbase
dzm-agent.sinks.dzm-sink3.table = t_invoice_ticket
dzm-agent.sinks.dzm-sink3.columnFamily = i
dzm-agent.sinks.dzm-sink3.serializer = com.bwjf.flume.invoice.dzm.sink.InvoiceHbaseSerializer
# detail-channel
dzm-agent.channels.dzm-channel-detail.type = org.apache.flume.channel.kafka.KafkaChannel
dzm-agent.channels.dzm-channel-detail.kafka.bootstrap.servers = ceshi185:19092,ceshi186:19092,ceshi185:19092
dzm-agent.channels.dzm-channel-detail.kafka.topic = flume_dzm_detail_channel
dzm-agent.channels.dzm-channel-detail.kafka.consumer.group.id = flume_dzm_detail_channel
# detail-sink group
dzm-agent.sinkGroups = dzm-detail-sg
dzm-agent.sinkGroups.dzm-detail-sg.sinks = dzm-sink-detail1 dzm-sink-detail2 dzm-sink-detail3
dzm-agent.sinkGroups.dzm-detail-sg.processor.type = load_balance
dzm-agent.sinkGroups.dzm-detail-sg.processor.backoff = load_balance
# detail-sink
dzm-agent.sinks.dzm-sink-detail1.type = asynchbase
dzm-agent.sinks.dzm-sink-detail1.table = t_invoice_detail_ticket
dzm-agent.sinks.dzm-sink-detail1.columnFamily = i
dzm-agent.sinks.dzm-sink-detail1.serializer = com.bwjf.flume.invoice.dzm.sink.InvoiceDetailHbaseSerializer
dzm-agent.sinks.dzm-sink-detail2.type = asynchbase
dzm-agent.sinks.dzm-sink-detail2.table = t_invoice_detail_ticket
dzm-agent.sinks.dzm-sink-detail2.columnFamily = i
dzm-agent.sinks.dzm-sink-detail2.serializer = com.bwjf.flume.invoice.dzm.sink.InvoiceDetailHbaseSerializer
dzm-agent.sinks.dzm-sink-detail3.type = asynchbase
dzm-agent.sinks.dzm-sink-detail3.table = t_invoice_detail_ticket
dzm-agent.sinks.dzm-sink-detail3.columnFamily = i
dzm-agent.sinks.dzm-sink-detail3.serializer = com.bwjf.flume.invoice.dzm.sink.InvoiceDetailHbaseSerializer
# assemble
dzm-agent.sources.dzm-source.channels = dzm-channel dzm-channel-detail
dzm-agent.sinks.dzm-sink1.channel = dzm-channel
dzm-agent.sinks.dzm-sink2.channel = dzm-channel
dzm-agent.sinks.dzm-sink3.channel = dzm-channel
dzm-agent.sinks.dzm-sink-detail1.channel = dzm-channel-detail
dzm-agent.sinks.dzm-sink-detail2.channel = dzm-channel-detail
dzm-agent.sinks.dzm-sink-detail3.channel = dzm-channel-detail
参考资料:
- 《Flume构建高可用、可扩展的海量日志采集系统》,Hari ShreedHaran著,马延辉,史东杰译。
- flume高并发优化——(3)haproxy
- flume高并发优化——(9)配置文件交由zookeeper管理
本文探讨了FlumeAgent在高并发场景下的多种优化策略,包括通过haproxy实现负载均衡、利用KafkaChannel提高数据处理效率、以及采用sink组进行资源的有效分配等。旨在构建稳定、高效的数据采集系统。
704

被折叠的 条评论
为什么被折叠?



