Flume+Hadoop大数据采集部署
引言
在大数据处理中,日志数据的采集是数据分析的第一步。Apache Flume是一个分布式、可靠且可用的系统,用于有效地收集、聚合和移动大量日志数据到集中式数据存储。本文将详细介绍如何使用Flume采集日志数据,并将其上传到Hadoop分布式文件系统(HDFS)中。
Flume简介
Apache Flume是一个高可用的、高可靠的,分布式的海量日志采集、聚合和传输的系统。它基于流式架构,提供了灵活性和简单性,能够实时读取服务器本地磁盘的数据,并将数据写入到HDFS。
系统要求
- Hadoop
Hadoop 2.8.0
百度网盘链接:https://pan.baidu.com/s/16VZGWk4kdiJ6GYxDP5BUew
提取码:j9fa - Flume 1.9.0
百度网盘链接:https://pan.baidu.com/s/1eLLKeQWaMvPjSJziEewfVA
提取码:3q2s - Centos 7
Flume配置结构
Flume的配置文件定义了数据流的来源和去向。以下是一个基本的配置示例,它定义了一个简单的Flume Agent,该Agent从一个本地端口收集数据,并将其输出到控制台
Flume的架构上可以知道,它主要分为三部分source、sink和channel
配置Flume
在/opt/server/flume/conf目录下创建flume-hdfs.conf文件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#配置source
a1.sources.r1.type = exec
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
a1.sources.r1.command = tail -F /var/log/flume-test.log # 设置要执行的命令,实时读取指定日志文件的最新内容
# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://zhang:9000/flume/logs/ #zhang为主机名,指令hostname一下就可以显示自己主机名,flume/logs/ 该文件路径不用手动创建文件夹,程序会自动创建
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 绑定 source 和sink 到 channel
a1.sources.r1.channels = c1 #注意此处channels 多了个S
a1.sinks.k1.channel = c1
启动Hadoop hdfs
start-dfs.sh
若HDFS搭建启动正常,打开Wed界面如下
也可以使用hdfs dfs -ls /
[root@localhost 192 conf]# hdfs dfs -ls /
Found 2 items
drwxr-xr-x - root supergroup 0 2024-07-13 08:09 /flume
-rw-r--r-- 1 root supergroup 0 2024-07-12 19:53 /test.txt
能查询到文件即可
启动Flume
需要在flume-hdfs.conf文件下,运行启动命令
[root@localhost 192 conf]# flume-ng agent --conf ./ --conf-file flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,console
Info: Sourcing environment configuration script /opt/server/flume/conf/flume-env.sh
Info: Including Hadoop libraries found via (/opt/server/hadoop-2.8.0/bin/hadoop) for HDFS access
Info: Including Hive libraries found via () for Hive access
+ exec /usr/lib/jvm/jdk1.8.0_65/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp '/opt/server/flume/conf:/opt/server/flume/lib/*:/opt/server/hadoop-2.8.0/etc/hadoop:/opt/server/hadoop-2.8.0/share/hadoop/common/lib/*:/opt/server/hadoop-2.8.0/share/hadoop/common/*:/opt/server/hadoop-2.8.0/share/hadoop/hdfs:/opt/server/hadoop-2.8.0/share/hadoop/hdfs/lib/*:/opt/server/hadoop-2.8.0/share/hadoop/hdfs/*:/opt/server/hadoop-2.8.0/share/hadoop/yarn/lib/*:/opt/server/hadoop-2.8.0/share/hadoop/yarn/*:/opt/server/hadoop-2.8.0/share/hadoop/mapreduce/lib/*:/opt/server/hadoop-2.8.0/share/hadoop/mapreduce/*:/opt/server/hadoop-2.8.0/contrib/capacity-scheduler/*.jar:/lib/*' -Djava.library.path=:/opt/server/hadoop-2.8.0/lib/native org.apache.flume.node.Application --conf-file flume-hdfs.conf --name a1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/server/flume/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/server/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2024-07-14 05:36:25,840 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:62)] Configuration provider starting
2024-07-14 05:36:25,850 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:138)] Reloading configuration file:flume-hdfs.conf
2024-07-14 05:36:25,861 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:c1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:r1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:r1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:r1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1117)] Added sinks: k1 Agent: a1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig(FlumeConfiguration.java:1203)] Processing:k1
2024-07-14 05:36:25,862 (conf-file-poller-0) [INFO