当我们想要了解一种新的框架的时候,官方文档是我们一定要看的,因为这是很权威的基本上也是没有错误的。。。
前面文章中也介绍了Flume的一些用法,本文将主要对Flume的官方文档和实际应用再做一个了解方便大家的学习。。
Flume简介
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Flume 是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量日志数据。它具有基于流数据流的简单灵活的架构。它具有可靠的可靠性机制和许多故障转移和恢复机制,具有强大的容错性。它使用简单的可扩展数据模型,允许在线分析应用程序。
WebServer是服务器,Flume要从指定的服务器通过WebServer获取数据,就是Source,,然后存储在Channel中,再通过Sink输出数据到HDFS中,核心流程是Agent内部的事件。。。
Flume是作为日志采集的,是分布式高可用的采集。。。
Flume要实现的目标是:
- 可靠性:保证数据的安全传递,一个节点宕机立刻被其他节点代替
- 扩展性:可以在Agent硬件基础上扩展性能
- 管理性:同类产品对比中更适应分布式环境
使用角度来说,日志信息进行转储的过程,可能是 多个Agent共同连续起作用 的,双层代理模式如下:
第一个Agent将数据输出后以RFC通信的形式作为第二个Agent的数据来源,采用的是多代理模式 ,之前代理的接收器和当前跳的源需要是avro类型,接收器执行源的主机名(或IP地址)和端口。。。
另外一种情况是多个WebServer同时共同产生日志,这些日志信息分别都需要存到一个共同的HDFS,因此要使用到多层代理:
使用avro接收器(一般用于跨节点运输数据)配置多个第一层代理在Flume中实现,所有这些代理指向单个代理的avro源(同样,在这种情况下可以使用thrift源/接收器/客户端)。第二层代理上的源将接收的事件合并到单个信道中,该信道由信宿器消耗到其最终目的地。
Flume支持多路复用到不同的地方去存储:
上面的例子显示了来自代理“foo”的源代码将流程扩展到三个不同的通道。扇出可以复制或多路复用。在复制流的情况下,每个事件被发送到所有三个通道。对于多路复用情况,当事件的属性与预配置的值匹配时,事件将被传递到可用通道的子集。
例如,如果一个名为“txnType”的事件属性设置为“customer”,那么它应该转到channel1和channel3,如果它是“vendor”,那么它应该转到channel2,否则转到channel3。可以在代理的配置文件中设置映射。
Flume使用案例
一、从指定网络端口采集数据输出到控制台
使用Flume的关键则是在于它的配置文件,首先我们看一下:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Flume启动关键在于配置文件,启动Flume,
$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/flume-telnet.conf -Dflume.root.logger==INFO,console
二、监控一个文件实时采集新增的数据输出到控制台
Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as cat [named pipe] or tail -F [file] are going to produce the desired results where as date will probably not - the former two commands produce streams of data where as the latter produces a single event and exits.
如果有一个固定的Exec源提供数据,那么提供就实时采集,但是如果进程中断就会一起中断。最重要的还是配置conf,这是任何情况下都最核心的事情。配置如下:
Agent选型为exec source +memory channel + logger sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/modules/apache-hive-1.2.2-bin/hive.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
a1.sinks.k1.type = logger
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
真正做数据处理的时候分为离线和实时,离线的情况下我们经常会用HDFS Sink,实时的情况下我们会使用flume+kafka,也就是用Kafka Sink,配置相似。。。
三、将A服务器日志实时采集到B服务器
Agent架构图如下:
选型:
exec source + memory channel + avro sink
avro source + memory channel + logger sink
写配置文件,Agent名字不要相同。。
exec-memory-avro.conf
# Name the components on this agent
avro.sources = exec-source
avro.sinks = avro-sink
avro.channels = memory-channel
# Describe/configure the source
avro.sources.exec-source.type = exec
avro.sources.exec-source.command = tail -F /home/1/modules/apache-flume-1.7.0-bin/dataSource/data.log
avro.sources.exec-source.shell = /bin/bash -c
# Describe the sink
avro.sinks.avro-sink.type = avro
avro.sinks.avro-sink.hostname = hadoop1
avro.sinks.avro-sink.port = 44444
# Use a channel which buffers events in memory
avro.channels.memory-channel.type = memory
avro.channels.memory-channel.capacity = 1000
avro.channels.memory-channel.transactionCapacity = 100
# Bind the source and sink to the channel
avro.sources.exec-source.channels = memory-channel
avro.sinks.avro-sink.channel = memory-channel
avro-memory-logger.conf
# Name the components on this agent
avro-memory-logger.sources = avro-source
avro-memory-logger.sinks = logger-sink
avro-memory-logger.channels = memory-channel
# Describe/configure the source
avro-memory-logger.sources.avro-source.type = avro
avro-memory-logger.sources.avro-source.bind = hadoop1
avro-memory-logger.sources.avro-source.port = 44444
# Describe the sink
avro-memory-logger.sinks.logger-sink.type = logger
# Use a channel which buffers events in memory
avro-memory-logger.channels.memory-channel.type = memory
avro-memory-logger.channels.memory-channel.capacity = 1000
avro-memory-logger.channels.memory-channel.transactionCapacity = 100
# Bind the source and sink to the channel
avro-memory-logger.sources.avro-source.channels = memory-channel
avro-memory-logger.sinks.logger-sink.channel = memory-channel
存在两个Agent,先启动avro-memory-logger 然后再启动exec-memory-avro
发送信息,信息是可以过来的,但是在exec-memory-avro却没有发现输出,这是因为没有对信息进行放出。
日志的采集过程:
- 机器上A监控一个文件,当访问主站时会有用户日志记录到x.log中
- avro sink将新产生的日志输出到对应的avro source指定的hostname和port上
- 通过avro source对应的Agent将日志输出到控制台Kafka中