Flume的使用

最新推荐文章于 2024-12-29 16:59:59 发布

鲸落万物

最新推荐文章于 2024-12-29 16:59:59 发布

阅读量174

点赞数

CC 4.0 BY-SA版权

文章标签： flume 大数据

本文链接：https://blog.youkuaiyun.com/m0_59651968/article/details/134073442

本文围绕Flume展开，介绍其是用于收集、聚合和传输大量日志数据的分布式服务，阐述了Agent、Source等主要特性和概念及工作流程。还说明了安装部署方法，通过多个案例展示其使用，如监控端口数据、实时监控文件等，最后提及拓展、停止脚本编写和简单串联。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Flume的使用

一、Flume的概述

1、简单介绍

Flume文档官方地址：Flume 1.9.0 User Guide — Apache Flume

Flume（Apache Flume）是一个用于收集、聚合和传输大量日志数据的分布式服务。它被设计成高度可靠、可扩展、可管理的，可以处理大规模数据流的系统。Flume通常用于将日志数据从各种数据源（如应用程序、Web服务器、传感器等）收集到中央存储或数据处理系统（如Hadoop、HBase等）中。

在这里插入图片描述

2、Flume的一些主要特性和概念

2.1、Agent（代理）

Flume的基本工作单元，负责收集、转换和传输数据。每个代理都有自己的配置文件，定义了数据源（Source）、通道（Channel）和目的地（Sink）等信息。

2.2、Source（数据源）

数据源是Flume代理的入口点，负责从各种数据源获取数据。Flume提供了多种内置的数据源，也支持自定义数据源。

2.3、Channel（通道）

Channel是位于Source和Sink之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作。

Flume自带两种Channel：Memory Channel（内存）和File Channel(磁盘)。

Memory Channel是内存中的队列。Memory Channel在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么Memory Channel就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。

File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。

2.4、Sink（目的地）

Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。

Sink组件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、kafka自定义。

2.5、Event（事件）

事件是Flume中的基本数据单元，代表从数据源传输到通道的数据。事件包含数据本身以及相关的元数据。以Event的形式将数据从源头送至目的地。Event由Header和Body两部分组成，Header用来存放该event的一些属性，为K-V结构，Body用来存放该条数据，形式为字节数组。

在这里插入图片描述

2.6、Interceptor（拦截器）

拦截器允许用户在事件流中执行转换和操作。它们可以在事件到达通道之前或之后修改事件。

2.7、Flume NG（Next Generation）

Flume NG是Flume的新版本，引入了许多改进和新特性，包括灵活的配置、可插拔的组件、事件拦截器等。

3、Flume的工作流程

数据源生成数据并发送到Flume的代理（Agent）。
代理使用配置文件中定义的Source接收数据，然后将数据传送到一个或多个Channel。
Channel负责缓存和存储事件，确保数据的可靠传输。
Sink从Channel中读取事件，并将数据传送到目的地（例如Hadoop、HBase等）。

Flume是一个非常强大且灵活的工具，用于处理大规模数据流。它的配置和扩展性使得它适用于各种数据收集和传输的场景。

在这里插入图片描述

二、Flume安装部署

Flume下载地址：Index of /dist/flume (apache.org)

# 上传压缩包，解压
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/module/
# 改名字
mv apache-flume-1.9.0-bin flume-1.9.0
# 删除lib文件夹下和Hadoop不兼容的guava-11.0.2.jar
rm -rf /opt/module/flume-1.9.0/lib/guava-11.0.2.jar
# 配置环境变量
vim /etc/profile
export FLUME_HOME=/opt/module/flume-1.9.0
export PATH=$PATH:$FLUME_HOME/bin
# 重新加载环境变量
source /etc/profile

三、Flume案例

1、监控端口数据官方案例

1.1、案例需求

使用Flume监听一个端口，收集该端口的数据，并打印到控制台。

在这里插入图片描述

1.2、实现步骤

1.2.1、安装netcat工具

yum install -y nc

1.2.2、判断44444端口是否被占用

netstat -nultp | grep 44444

1.2.3、编写Agent配置文件

# 首先在flume目录下新建一个myconf文件，以后所有的agent配置文件都放在这下面
# 创建Agent配置文件
vim flume-netcat-loggeer.conf
# 命名此代理上的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述/配置来源
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# 使用缓冲内存中事件的通道
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source和sink绑定到通道
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

1.2.4、开启Flume监听

flume-ng agent --conf conf --name a1 --conf-file ./myconf/flume-netcat-logger.conf  -Dflume.root.logger=INFO,console

flume-ng agent: 这是启动Flume代理的命令。
--conf conf: 这个参数指定了Flume的配置文件目录，通常情况下，Flume的配置文件会存放在这个目录中。
--name a1: 这个参数指定了Flume代理的名字，这个名字在配置文件中有特定的含义，用来标识不同的代理实例。
--conf-file ./myconf/flume-netcat-logger.conf: 这个参数指定了Flume的配置文件的路径。在这个例子中，配置文件的路径是当前目录下的myconf文件夹中的flume-netcat-logger.conf文件。
-Dflume.root.logger=INFO,console: 这个参数设置了Flume的日志级别和输出方式。在这个例子中，日志级别被设置为INFO（信息级别），并且日志会输出到控制台（console）。

1.2.5、使用netcat工具向本机的44444端口发送内容

nc localhost 44444

在这里插入图片描述

1.2.6、查看开启Flume监听端的结果

在这里插入图片描述

2、exec实时监控单个追加文件

2.1、案例需求

实时监控Hive日志，并上传到HDFS中

在这里插入图片描述

2.2、实现步骤

2.2.1、检查环境

Flume要想将数据输出到HDFS，依赖Hadoop相关的jar包。

# 检查/etc/profile文件，确认Hadoop和JAVA环境变量配置正确
export JAVA_HOME=/opt/module/jdk1.8.0_211
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/opt/module/hadoop-3.3.1
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

2.2.2、编写Agent配置文件

注意：要想读取Linux系统中的文件，就得按照Linux命令的规则执行命令。由于Hive日志在Linux系统中，所有读取文件的类型选择：exec 即execute执行的意思。表示执行Linux命令来读取文件。exec不能监控目录，只能监控文件。

vim flume-file-hdfs2.conf

# 命名此代理上的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述/配置来源 source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive-3.1.2/logs/hive.log

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.48.10:9000/flume/%Y-%m-%d/
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件 30s
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小  1024B
a1.sinks.k1.hdfs.rollSize = 1024
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

# 使用缓冲内存中事件的通道
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source和sink绑定到通道
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2.2.3、编写Flume执行脚本

# 在Flume的bin目录下创建flume.sh，添加以下内容

#!/bin/bash
/opt/module/flume-1.9.0/bin/flume-ng agent --conf /opt/module/flume-1.9.0/conf --name a1 --conf-file /opt/module/flume-1.9.0/myconf/$1 -Dflume.root.logger=INFO,console

# 赋予该脚本权限
chmod 777 flume.sh

2.2.4、执行脚本

# 执行该脚本
flume.sh flume-file-hdfs2.conf
# 向hive.log中追加数据或者启动hive，查看hdfs的web页面情况

3、spooldir实时监控目录下多个新文件

3.1、案例需求

使用Flume监听整个目录的文件，并上传至HDFS

在这里插入图片描述

3.2、实现步骤

3.2.1、编写Agent配置文件

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/module/testData/spoolDir
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.fileHeader = true
#忽略所有以.tmp结尾的文件，不上传
a1.sources.r1.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.48.10:9000/flume/spoolDir/%Y-%m-%d/
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 1024B
a1.sinks.k1.hdfs.rollSize = 1024
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3.2.2、开启Flume监听

# 1.使用自己写的脚本，执行监听
flume.sh flume-spooldir-hdfs.conf
# 2.在/opt/module/testData/spoolDir下面创建文件并写入数据
# 3.查看HDFS

注意：spooldir只能监控新文件的写入

4.tailDir监控日志采集到hdfs集群的最终版本

tailDir：flume1.8版本之后

Exec source适用于监控一个实时追加的文件，不能实现断点续传；Spooldir Source适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步；而Taildir Source适合用于监听多个实时追加的文件，并且能够实现断点续传。

优点：

可以同时监控多个目录
可以维护索引（不同目录的索引是相互分离的）
数据不会丢失

4.1、案例需求

使用Flume监听整个目录的实时追加文件，并上传至HDFS

在这里插入图片描述

4.2、实现步骤

4.2.1、编写Agent配置文件

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
# 索引存放位置
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/index/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /opt/module/testData/tailDir/topic_event/2019-06-09/.*log.*
a1.sources.r1.filegroups.f2 = /opt/module/testData/tailDir/topic_start/2019-06-09/.*log.*

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.48.10:9000/flume/tailDir/2019-06-09
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = tailDir-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件 30s
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 100KB
a1.sinks.k1.hdfs.rollSize = 102400
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4.2.2、执行脚本

# 1.启动Flume监听
flume.sh flume-taildir-hdfs.conf
# 2.追加文件内容或者新建文件并添加内容
# 3.查看HDFS，
# 4.还可以查看索引/opt/module/flume-1.9.0/index/taildir_position.json

四、tailDir-hdfs拓展

1、如何把日期写活

#!/bin/bash

echo "
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/index/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /opt/module/testData/tailDir/topic_event/`date -d '-1 day' +%F`/.*log.*
a1.sources.r1.filegroups.f2 = /opt/module/testData/tailDir/topic_start/`date -d '-1 day' +%F`/.*log.*

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.48.10:9000/flume/tailDir/`date -d '-1 day' +%F`
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = tailDir-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件 30s
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 100KB
a1.sinks.k1.hdfs.rollSize = 102400
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
" > /opt/module/flume-1.9.0/myconf/flume-taildir-hdfs.conf

/opt/module/flume-1.9.0/bin/flume-ng agent --conf /opt/module/flume-1.9.0/conf --name a1 --conf-file /opt/module/flume-1.9.0/myconf/flume-taildir-hdfs.conf -Dflume.root.logger=INFO,console

2、根据数据的不同用途把数据分开

拦截器

执行脚本：tailDir-hdfs.sh 2019-06-09，后面参数不写，默认是前一天

#!/bin/bash

if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d '-1 day' +%F`
fi

echo "
a1.sources = r1 r2
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
# 拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = topic_event

a1.sources.r1.positionFile = /opt/module/flume-1.9.0/index/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/testData/tailDir/topic_event/$do_date/.*log.*

# Describe/configure the source
a1.sources.r2.type = TAILDIR
# 拦截器
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = topic_start

a1.sources.r2.positionFile = /opt/module/flume-1.9.0/index/taildir_position.json
a1.sources.r2.filegroups = f2
a1.sources.r2.filegroups.f2 = /opt/module/testData/tailDir/topic_start/$do_date/.*log.*


# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.48.10:9000/flume/tailDir/%{type}/$do_date
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = tailDir-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件 30s
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 100KB
a1.sinks.k1.hdfs.rollSize = 102400
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sinks.k1.channel = c1
" > /opt/module/flume-1.9.0/myconf/flume-taildir-hdfs.conf

/opt/module/flume-1.9.0/bin/flume-ng agent --conf /opt/module/flume-1.9.0/conf --name a1 --conf-file /opt/module/flume-1.9.0/myconf/flume-taildir-hdfs.conf -Dflume.root.logger=INFO,console

Agent中编写的a1.sinks.k1.hdfs.rollInterval = 30 或a1.sinks.k1.hdfs.rollSize = 102400 等等，只要有一个达到了，他就会滚动数据写入，否则就是临时文件，在等待数据写入。

五、flume停止的脚本编写

#!/bin/bash
ps -ef | grep flume | grep -v grep | awk -F " " '{print $2}' | xargs kill -9

六、flume的简单串联

流程梳理：通过node01的nc命令输入数据，node01的flume将数据发送到node02的控制台输出。先启动node02的flume，再启动node01的flume，再启动nc。

node01的Agent文件编写,将node02作为sink

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.48.20
a1.sinks.k1.port = 4545

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

node02的Agent文件编写,将node01作为source

# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.48.20
a1.sources.r1.port = 4545

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动

# 1.启动node02的flume
/opt/module/flume-1.9.0/bin/flume-ng agent --conf /opt/module/flume-1.9.0/conf --name a1 --conf-file /opt/module/flume-1.9.0/myconf/flume_node01_node02.conf -Dflume.root.logger=INFO,console

# 2.启动node01的flume
/opt/module/flume-1.9.0/bin/flume-ng agent --conf /opt/module/flume-1.9.0/conf --name a1 --conf-file /opt/module/flume-1.9.0/myconf/flume_to_node02.conf -Dflume.root.logger=INFO,console

# 3.再node01启动nc
nc localhost 44444