Flume 数据采集详解-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_41072814/article/details/115604018

Flume概述

定义：Flume是cloudera开发的一个高可用、高可靠、分布式的海量日志采集、聚合和传输的系统。

特点：Flume基于流式架构，简单灵活。

作用：实时读取服务器本地磁盘的数据，并将数据写入到HDFS中。当然Flume并非只能从磁盘读取数据或是将数据上传到HDFS中，Flume也可以从python爬虫数据平台抓取数据到HDFS或是Kafka中。

Flume基础架构

Flume基础架构图

在这里插入图片描述
图一、Flume基础架构图

Agent

定义：Agent是一个JVM进程，它以事件的形式将数据从源头送往目的地。

组成：Agent主要由Source、Channel以及Sink三个部分构成。

Source

定义：Source是负责接收数据到Flume Agent的组件，是Agent的“入口”。

作用：Source组件可以处理各种类型、各种格式的日志数据。

分类：avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http以及legacy等。

Sink

定义：sink是负责发送数据到下一级消费者的组件，是Agent的“出口”。

作用：sink会不断地轮询channel中的事件并批量的移除它们，并将这些事件批量写入到存储或索引系统，当然也可能将这些事件发往另外一个Flume Agent。

sink组件的目的地：hdfs、logger、avro、thrift、ipc、file、hbase、solr等。

Channel

定义：Channel是位于Source和Sink之间的缓冲区。

作用：Channel位于Source和Sink之间，负责事件的传输和缓冲，KafkaChannel甚至还会代替Sink发送数据到Kafka。

分类：MemoryChannel、FileChannel、KafkaChannel等。

MemoryChannel——是内存中的队列。MemoryChannel基于内存，因此传输效率高，但是可靠性低。因此如果程序关闭或宕机，内存中的数据会丢失。

FileChannel——将所有事件写到磁盘，因此，程序关闭或宕机都不会让数据产生丢失的情况。

Event

定义：Event是Flume的传输单元（也可以说是Flume的传输单位）。

构成：Event由Header和Body两部分构成。Header以K-V形式来存放event的一些属性。Body则用来存储数据本身，形式为字节数组。

Flume版本

Flume0.9版本之前称之为flume og

Flume0.9版本之后称之为flume ng

Flume安装部署

上传apache-flume-1.9.0-bin.tar.gz

解压apache-flume-1.9.0-bin.tar.gz

tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /path/to/your/flume

改名apache-flume-1.9.0-bin.tar.gz

mv apache-flume-1.9.0-bin flume-1.9.0

删除掉flumelib文件夹下的guava-11.0.2.jar包以兼容hadoop-3.1.3

rm -rf /path/to/your/flume/lib/guava-11.0.2.jar

配置环境变量

sudo vim /etc/profile.d/my_env.sh

#flume

export FLUME_HOME=/opt/module/flume-1.9.0

export PATH= $P A T H :$ FLUME_HOME/bin

sudo source /etc/profile.d/my_env.sh

Flume常用命令

flume-ng

命令

agent

参数

--conf / -c

--name / -n

--conf-file / -f

等

示例：

flume-ng --name agent的名字 --conf 配置文件的目录 --conf-file 配置文件（自己手动配置）

Agent配置文件

样例

# example.conf: A single-node Flume configuration

# 定义agent名字为a1
#定义source名字r1
#定义sink名字k1
#定义channel名字c1
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Agent配置步骤

1.配置agent名字，例如a1

2.配置sources、channels、sinks名字，例如r1、c1、k1

3.指明source、channel、sink的具体类型并进行配置，例如 r1.type=netcat

4.指明source、channel、sink三者之间的关系

Flume入门案例

案例需求：使用flume监听一个端口，收集该端口数据并打印到控制台

需求分析：如图

在这里插入图片描述
图二、flume监听端口输出到控制台

实现方式与步骤

实现方式：

source监听端口

channel选择memorychannel

sink输出到控制台

实现步骤：

1.安装netcat

sudo yum intall -y netcat

2.判断44444端口是否被占用（其实并不一定是44444端口，这里端口号需要根据自己需要测试的端口自行定义）

sudo netstat -tulp | grep 44444

3.利用nc监听

nc -l 44444

4.在任意位置配置flume-netcat-logger.conf（我这里是在flume目录下创建job文件夹，将flume-netcat-logger.conf配置在job目录下）

cd /opt/module/flume-1.9.0
mkdir job

配置flume-netcat-logger.conf文件

vim flume-netcat-logger.conf

添加内容如下：

a1.sources=r1
a1.channels=c1
a1.sinks=k1

a1.sources.r1.type=netcat
a1.sources.r1.bind=hadoop002
a1.sources.r1.port=44444

a1.channels.c1.type=memory
#可写可不写，1000是默认值
a1.channels.c1.capacity=1000
#可写可不写，100是默认值
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.r1.channels=c1
#一个sink只能对应一个channel
a1.sinks.k1.channel=c1

5.运行flume

flume-ng agent --namme a1 --conf /opt/module/flume-1.9.0/conf --conf-file /opt/module/flume-1.9.0/job/netcatsource_loggersink.conf	-Dflume.root.logger=INFO,console

或

flume-ng agent --n a1 -c /opt/module/flume-1.9.0/conf -f /opt/module/flume-1.9.0/job/netcatsource_loggersink.conf -Dflume.root.logger=INFO,console

案例二（Execsource_HDFSSink）

1.创建demo文件夹

cd /opt/module/flume-1.9.0/
mkdir demo
cd demo
touch test01.log

2.编辑flume配置文件Execsource_HDFSSink.conf

cd /opt/module/flume-1.9.0/job/
vim execsource_hdfssink.conf

3.具体配置内容如下

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=exec
a1.sources.r1.command=tail -F /opt/module/flume-1.9.0/demo/test01.log

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=hdfs
#按照日期自动存储到hdfs上(即使用时间转义序列，使用时间转义序列需要满足两个条件中的任意一个————)
a1.sinks.s1.hdfs.path=hdfs://hadoop001:8020/flume/%Y%m%d/%H
#设置使用本地时间戳
a1.sinks.s1.hdfs.useLocalTimeStamp=true
#设置文件前缀
a1.sinks.s1.hdfs.filePrefix=log-
#开启文件夹时间滚动（按时间滚动）
a1.sinks.s1.hdfs.round=true
#设置多长时间滚动一次文件夹
a1.sinks.s1.hdfs.roundValue=1
#设置时间滚动的单位
a1.sinks.s1.hdfs.roundUnit=hour

#积攒到多少event才flush到hdfs一次
a1.sinks.s1.hdfs.batchSize=100
#设置文件类型，可设置为支持压缩
a1.sinks.s1.hdfs.fileType=DataStream

#设置多长时间生成一个新的文件（单位是秒）
a1.sinks.s1.hdfs.rollInterval=60
#设置每个文件的滚动大小
a1.sinks.s1.hdfs.rollSize=134217700
#设置文件的滚动和event数量无关
a1.sinks.s1.hdfs.rollCount=0

#将source、sink分别和channel绑定起来
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

案例三（实时监控目录下的多个新文件，即，spooldirsource_hdfssink）

1.进入到flume-1.9.0目录下并新建upload文件夹；进入upload目录下并创建test01.log文件。我们可以通过echo命令向test01.log中写入几条测试数据。

cd /opt/module/flume-1.9.0/
mkdir upload && cd upload && touch test01.log
echo 11 >> test01.log

2.进入job目录下并编辑spooldirsource_hdfssink.conf文件

cd /opt/module/flume-1.9.0/job && vim spooldirsource_hdfssink.conf

3.具体配置内容如下

a1.sources=r1
a1.channels=c1
a1.sinks=s1

#配置source为spooldir（用来监听一个目录，自动进行收集该目录中的内容）
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/opt/module/flume-1.9.0/upload
a1.sources.r1.fileSuffix=.COMPLETED

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=hdfs
#按照日期自动存储到hdfs上(即使用时间转义序列，使用时间转义序列需要满足两个条件中的任意一个————)
a1.sinks.s1.hdfs.path=hdfs://hadoop001:8020/flume/%Y%m%d/%H
#设置使用本地时间戳
a1.sinks.s1.hdfs.useLocalTimeStamp=true
#设置文件前缀
a1.sinks.s1.hdfs.filePrefix=log-
#开启文件夹时间滚动（按时间滚动）
a1.sinks.s1.hdfs.round=true
#设置多长时间滚动一次文件夹
a1.sinks.s1.hdfs.roundValue=1
#设置时间滚动的单位
a1.sinks.s1.hdfs.roundUnit=hour

#积攒到多少event才flush到hdfs一次
a1.sinks.s1.hdfs.batchSize=100
#设置文件类型，可设置为支持压缩
a1.sinks.s1.hdfs.fileType=DataStream

#设置多长时间生成一个新的文件（单位是秒）
a1.sinks.s1.hdfs.rollInterval=60
#设置每个文件的滚动大小
a1.sinks.s1.hdfs.rollSize=134217700
#设置文件的滚动和event数量无关
a1.sinks.s1.hdfs.rollCount=0

#将source、sink分别和channel绑定起来
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

案例四（实时读取目录文件到HDFS，taildirsource_hdfssink）

1.准备工作（创建taildirposition文件夹并进入该目录下）

cd /opt/module/flume-1.9.0
mkdir taildirposition && cd taildirposition

2.修改相关配置文件

cd /opt/module/flume-1.9.0/job
vim taildirsource_hdfssink.conf

3.具体配置内容如下

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=TAILDIR
#flume读取到的内容的位置存在
a1.sources.r1.positionFile=/opt/module/flume-1.9.0/taildirposition
#设置文件组
a1.sources.r1.filegroups=f1
a1.sources.r1.filegroups.f1=/opt/module/flume-1.9.0/demo/test01.log

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

Flume事务

Flume事务见下图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-D2xMveCE-1618987414723)(C:\Users\Kera\AppData\Roaming\Typora\typora-user-images\image-20210420111959010.png)]

图三、Flume事务

Flume的Agent内部原理

1.Source接收数据，并将数据封装成Event发给ChannelProcessor；

2.ChannelProcessor处理事件，将Event数据传递给拦截器链（多个拦截器）对Event数据进行拦截；

3.Inteceptor（拦截器）负责数据拦截功能的执行（其实就是对Event中的数据进行处理）并将拦截后的数据返还给channel；

4.Channel从Inteceptor处接收Event数据，并将Event数据发送给ChannelSelector。

5.ChannelSelector在有多个Channel时发挥作用。ChannelSelector有两种————Replicating Channel Selector和Multiplexing Channel Selector。Replicating Channel Selector会将Event数据发往所有Channel；Multiplexing Channel Selector则会将Event数据发往指定的Channel。ChannelSelector处理过后将数据返还给ChannelProcessor；

6.根据ChannelSelector选择器的结果，由ChannelProcessor将Event数据发往对应的Channel

7.SinkProcessor

8.Event数据最终发往Sink

在这里插入图片描述

图四、Agent内部原理图

Flume拓扑结构

串联

在这里插入图片描述

图五、flume串联模式

串联模式下，flume（Agent）数量不宜过多，否则数据传输效率和安全性皆无法得到可靠满足！

直接上配置

hadoop001（Agent1）上的配置：

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=netcat
a1.sources.r1.bind=hadoop001
a1.sources.r1.port=44444

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=avro
a1.sinks.s1.hostname=hadoop002
a1.sinks.s1.port=20000

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

hadoop002（Agent2）上的配置：

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop002
a1.sources.r1.port=20000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

配置完成后先启动hadoop002(Agent2)，然后再启动hadoop001(Agent1)。

复制和多路复用

在这里插入图片描述

图六、复制和多路复用

复制还是多路复用，主要取决于channelselector设置为replicating（复制）还是multiplexing（多路复用）。

复制案例

先在hadoop001（Agent1）上配置execsource_avrosink.conf

a1.sources=r1
a1.channels=c1 c2
a1.sinks=s1 s2

a1.sources.r1.type=exec
a1.sources.r1.command=tail -F /opt/module/flume-1.9.0/demo/test01.log
#配置channelselector，默认就是replicating，可以不用配置，保持默认即可。
a1.sources.r1.selector.type=replicating

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.channels.c2.type=memory
a1.channels.c2.capacity=1000

a1.sinks.s1.type=avro
a1.sinks.s1.hostname=hadoop002
a1.sinks.s1.port=20000

a1.sinks.s2.type=avro
a1.sinks.s2.hostname=hadoop003
a1.sinks.s2.port=30000

a1.sources.r1.channels=c1 c2
a1.sinks.s1.channel=c1
a1.sinks.s2.channel=c2

再在hadoop002（Agent2）上配置avrosource_filerollrsink.conf

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop002
a1.sources.r1.port=20000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=file_roll
a1.sinks.s1.sink.directory=/opt/module/flume-1.9.0/output
#设置多长时间滚动一次，单位默认是秒且默认是30秒滚动一次
a1.sinks.s1.sink.rollInterval=30

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

最后在hadoop003（Agent3）上配置avrosource_hdfssink.conf

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop003
a1.sources.r1.port=30000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=hdfs
#按照日期自动存储到hdfs上(即使用时间转义序列，使用时间转义序列需要满足两个条件中的任意一个————)
a1.sinks.s1.hdfs.path=hdfs://hadoop001:8020/flume/%Y%m%d/%H
#设置使用本地时间戳
a1.sinks.s1.hdfs.useLocalTimeStamp=true
#设置文件前缀
a1.sinks.s1.hdfs.filePrefix=log-
#开启文件夹时间滚动（按时间滚动）
a1.sinks.s1.hdfs.round=true
#设置多长时间滚动一次文件夹
a1.sinks.s1.hdfs.roundValue=1
#设置时间滚动的单位
a1.sinks.s1.hdfs.roundUnit=hour

#积攒到多少event才flush到hdfs一次
a1.sinks.s1.hdfs.batchSize=100
#设置文件类型，可设置为支持压缩
a1.sinks.s1.hdfs.fileType=DataStream

#设置多长时间生成一个新的文件（单位是秒）
a1.sinks.s1.hdfs.rollInterval=60
#设置每个文件的滚动大小
a1.sinks.s1.hdfs.rollSize=134217700
#设置文件的滚动和event数量无关
a1.sinks.s1.hdfs.rollCount=0

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

先启动hadoop003（Agent3）和hadoop002（Agent2）再启动hadoop001（Agent1）即可。

多路复用案例

先配置hadoop001（Agent1）

a1.sources=r1
a1.channels=c1 c2
a1.sinks=s1 s2

a1.sources.r1.type=exec
a1.sources.r1.command=tail -F /opt/module/flume-1.9.0/demo/test01.log

#配置channelselector，默认就是replicating，可以不用配置，保持默认即可。
a1.sources.r1.selector.type=multiplexing
a1.sources.r1.selector.header=state
a1.sources.r1.selector.mapping.CZ=c1
a1.sources.r1.selector.mapping.US=c2
#给Event中的headers添加数据
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = state
a1.sources.r1.interceptors.i1.value = CZ


a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.channels.c2.type=memory
a1.channels.c2.capacity=1000

a1.sinks.s1.type=avro
a1.sinks.s1.hostname=hadoop002
a1.sinks.s1.port=20000

a1.sinks.s2.type=avro
a1.sinks.s2.hostname=hadoop003
a1.sinks.s2.port=30000

a1.sources.r1.channels=c1 c2
a1.sinks.s1.channel=c1
a1.sinks.s2.channel=c2

再配置hadoop002（Agent2）

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop002
a1.sources.r1.port=20000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

最后配置hadoop003（Agent3）

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop003
a1.sources.r1.port=30000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

负载均衡和故障转移

在这里插入图片描述

图七、负载均衡和故障转移

是负载均衡还是故障转移主要取决于sinkprocessor是loadbalacing还是failover

故障转移案例

Agent1（hadoop001）

a1.sources=r1
a1.channels=c1
a1.sinks=s1 s2

a1.sources.r1.type=netcat
a1.sources.r1.bind=hadoop001
a1.sources.r1.port=22222

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = s1 s2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.s1 = 5
a1.sinkgroups.g1.processor.priority.s2 = 10

a1.sinks.s1.type=avro
a1.sinks.s1.hostname=hadoop002
a1.sinks.s1.port=20000

a1.sinks.s2.type=avro
a1.sinks.s2.hostname=hadoop003
a1.sinks.s2.port=30000

a1.sources.r1.channels=c1 c2
a1.sinks.s1.channel=c1
a1.sinks.s2.channel=c1

Agent2（hadoop002）

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop002
a1.sources.r1.port=20000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

Agent3（hadoop003）

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop003
a1.sources.r1.port=30000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

负载均衡案例

Agent1（hadoop001）

a1.sources=r1
a1.channels=c1
a1.sinks=s1 s2

a1.sources.r1.type=netcat
a1.sources.r1.bind=hadoop001
a1.sources.r1.port=22222

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

#指定sinks的组名
a1.sinkgroups = g1
#向g1sinks组中添加sink
a1.sinkgroups.g1.sinks = s1 s2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.selector = random

a1.sinks.s1.type=avro
a1.sinks.s1.hostname=hadoop002
a1.sinks.s1.port=20000

a1.sinks.s2.type=avro
a1.sinks.s2.hostname=hadoop003
a1.sinks.s2.port=30000

a1.sources.r1.channels=c1 c2
a1.sinks.s1.channel=c1
a1.sinks.s2.channel=c1

Agent2（hadoop002）

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop002
a1.sources.r1.port=20000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

Agent3（hadoop003）

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=avro
a1.sources.r1.bind=hadoop003
a1.sources.r1.port=30000

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

聚合

在这里插入图片描述

图八、聚合

拦截器

在入门案例基础上实现拦截器

具体配置如下：

a1.sources=r1
a1.channels=c1
a1.sinks=k1

a1.sources.r1.type=netcat
a1.sources.r1.bind=hadoop002
a1.sources.r1.port=44444

#配置timestamp拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

a1.channels.c1.type=memory
#可写可不写，1000是默认值
a1.channels.c1.capacity=1000
#可写可不写，100是默认值
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.r1.channels=c1
#一个sink只能对应一个channel
a1.sinks.k1.channel=c1

配置完成后通过flume-ng命令启动Agent。

flume-ng agent -n a1 -c /opt/module/flume-1.9.0/conf/ -f /opt/module/flume-1.9.0/job/netcatsource_intercetor_loggersink.conf -Dflume.root.logger=INFO,console

然后运行nc命令启动netcat并向hadoop001发送数据

nc hadoop001 44444

观察hadoop001接收数据的结果可以发现原本headers位置出现了以timestamp为key，以具体的时间为value的keyvalue键值对。

在这里插入图片描述

图九、timestamp

自定义拦截器

a1.sources=r1
a1.channels=c1 c2
a1.sinks=s1 s2

a1.sources.r1.type=netcat
a1.sources.r1.bind=hadoop001
a1.sources.r1.port=44444


a1.sources.r1.selector.type=multiplexing
a1.sources.r1.selector.header=type
a1.sources.r1.selector.mapping.letter=c1
a1.sources.r1.selector.mapping.number=c2

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.ithome.demo.MyInterceptor$MyBuilder

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000

a1.channels.c2.type=memory
a1.channels.c2.capacity=1000

a1.sinks.s1.type=avro
a1.sinks.s1.hostname=hadoop002
a1.sinks.s1.port=20000

a1.sinks.s2.type=avro
a1.sinks.s2.hostname=hadoop003
a1.sinks.s2.port=30000

a1.sources.r1.channels=c1 c2
a1.sinks.s1.channel=c1
a1.sinks.s2.channel=c2

代码部分

1.创建MyInterceptor类

2.implements Interceptor实现Interceptor接口

3.重写initialize()、Event intercept(Event event)、List intercept(List list)、close()方法

4.主要重写Event intercept(Event event)方法

4.1获取body中的内容

4.2判断body中的内容是数字还是字母

逻辑代码如下：

package com.ithome.demo;


import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;

/**
 * 自定义拦截器
 * 根据body中的内容在headers中添加指定的key-value
 * 如果内容为字母则添加type=letter
 * 如果内容为数值则添加type=number
 */
public class MyInterceptor implements Interceptor {
    @Override
    public void initialize() {

    }

    /**
     * channelprocessor调用拦截器时会调用该方法并将event传过来
     *
     * @param event
     * @return
     */
    @Override
    public Event intercept(Event event) {
        //1.获取body中的内容
        byte[] body = event.getBody();

        //2.判断body中的内容是数值还是字母
        if ((body[0] >= 'A' && body[0] <= 'Z') || (body[0] >= 'a' && body[0] <= 'z')) {

            event.getHeaders().put("type", "letter");

        } else if (body[0] >= 0 && body[0] <= 9) {
            event.getHeaders().put("type", "number");
        }

        return event;
    }

    @Override
    public List<Event> intercept(List<Event> events) {
        //1.遍历集合中所有的event
        for (Event event: events) {
            //2.由于主要的逻辑代码在public Event intercept(Event event){}方法中实现了，
            // 这里只需要再次将event传入，再走一次public Event intercept(Event event)方法即可
            intercept(event);
        }
        return events;
    }

    @Override
    public void close() {

    }
     
    public static class MyBuilder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new MyInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
}

5.利用idea打包代码

6.将打包好的代码上传到/opt/module/flume-1.9.0/lib目录下

自定义source

package com.ithome.demo;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.channel.ChannelProcessor;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;


/**
 * 自定义source
 * 需求：使用flume接收数据，并给每条数据添加前缀，输出到控制台。前缀可以在flume配置文件中进行配置。
 */
public class MySource extends AbstractSource implements Configurable, PollableSource {

    private String prefix;

    /**
     * 核心方法。接收数据并且将之封装成Event，然后将封装好的Event发往channel中。该方法会循环调用。
     * @return
     * @throws EventDeliveryException
     */
    @Override
    public Status process() throws EventDeliveryException {
        /**
         * 以下代码是一个Event一个Event地添加
         */
        //1.创建Event
//        SimpleEvent simpleEvent = new SimpleEvent();
        //2.给Event设置数据
//        simpleEvent.setBody( (prefix+"").getBytes());
        //3.将数据放入到channel中
//        ChannelProcessor channelProcessor = getChannelProcessor();
        //4.将数据放入到channelProcessor
//        channelProcessor.processEvent(simpleEvent);
        /**
         * 以下代码是一次添加多个Event
         */
        Status status = Status.READY;
        List<Event> eventList = new ArrayList<>();
        for(int i = 0;i<100;i++){
            SimpleEvent simpleEvent = new SimpleEvent();
            simpleEvent.setBody((prefix+"hello"+i).getBytes(StandardCharsets.UTF_8));
            eventList.add(simpleEvent);
        }
        try {
            ChannelProcessor channelProcessor = getChannelProcessor();
            channelProcessor.processEventBatch(eventList);
        }catch (Exception e){
            status = Status.BACKOFF;
        }

        return status;
    }

    /**
     *获取上下文，读取配置文件中的内容。
     * @param context
     */
    @Override
    public void configure(Context context) {

    }

    /**
     * 当source没有数据可供封装时，该方法会让source所在的线程休息一会。
     * @return
     */
    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    /**
     *
     * @return
     */
    @Override
    public long getMaxBackOffSleepInterval() {
        return 2000L;
    }
}

配套的flume配置(customsource.conf)

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=com.ithome.demo.MySource

a1.channels.c1.type=memory

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.r1.channel=c1

Ganglia

下载安装ganglia

# CentOS系统中默认的yum源并没有包含Ganglia，需安装扩展yum源
wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
rpm -ivh epel-release-latest-7.noarch.rpm

# 安装gmetad、gmond、web前端
yum -y install ganglia*

配置ganglia

1.更改ganglia.conf配置

sudo vim /etc/httpd/conf.d/ganglia.conf

具体配置如下：

<Location /ganglia>
  # Order deny,allow
  # Allow from all
  # Deny from all
  # Allow from 127.0.0.1
  # Allow from ::1
  # Allow from .example.com
  Require all granted
</Location>

2.更改gmetad.conf配置

sudo vim /etc/ganglia/gmond.conf

具体配置如下：

cluster {
  name = "hadoop001"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}
udp_send_channel {
  #bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  # mcast_join = 239.2.11.71
  host = 192.168.116.150
  port = 8649
  ttl = 1
}
udp_recv_channel {
  # mcast_join = 239.2.11.71
  port = 8649
  bind = 192.168.116.150
  retry_bind = true
  # Size of the UDP buffer. If you are handling lots of metrics you really
  # should bump it up to e.g. 10MB or even higher.
  # buffer = 10485760
}

3.更改/etc/selinux/config配置

sudo vim /etc/selinux/config

具体配置如下：

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

修改完/etc/selinux/config文件是需要重启的，如果不想重启服务器，则可以使之临时生效

sudo setenforce 0

4.启动ganglia

sudo service httpd start && sudo service gmetad start && sudo service gmond start

5.打开网页访问ganglia

http://192.168.116.150/ganglia

操作flume测试监控

Flume面试题

1.你是如何实现flume数据传输监控的？

答：我们使用的是第三方监控框架ganglia

2.flume的source、sink和channel的作用。你们source使用的是什么类型？

我公司采用的source类型为

监控后台日志——exec

监控后台产生日志的端口——netcat

3.flume的channel selector

channels selector分为replicating和multiplexing。

replicating会将source传过来的Event数据发往所有的channel。

multiplexing会将source传过来的Event数据有选择的发往某些channel。

4.flume调优

4.1.source

增加source数量

使用的source如果是taildirsource时可以增加filegroups数量

调整batchsize

4.2.channel

memory channel性能高，但是数据容易丢失

可以调整transactionCapacity，

4.3.sink

增加sink个数可以提高sink消费Event的能力。但是过多的sink会占用大量系统资源从而影响其性能。

batchsize参数决定着sink一次从channel读取event的条数，适当调大这个值可以提高sink消费Event的速率。

5.flume事务机制

6.flume采集数据会丢失吗？

答：flume采集数据不会丢失数据。由于flume的事务机制，使得采集数据过程中出现意外情况，flume仍然能够保证数据安全可靠。具体一点。flume从source到channel是事务性的，从channel到sink也是事务性的。事务性可以保障flume不会丢失数据，但是可能导致数据重复。例如数据已经成功由sink发出，但是没有接收到响应，sink会再次发送数据，此时可能导致数据重复。当然，如果你选用的是memory channel，那么如果在数据传输到channel时或从channel向外传输数据时发生了断电等导致宕机的事故，那么这些数据还是有可能丢失掉的。