数据采集之flume

原创已于 2022-02-22 16:19:43 修改

· 867 阅读

0 ·

版权

文章标签：

#flume #大数据 #big data

于 2022-02-22 16:03:09 首次发布

本文详细介绍了Flume的基本概念，包括其作为大数据实时采集工具的角色，以及使用Flume进行数据采集的步骤，如配置文件设定、启动流程等。通过实例展示了如何利用netcat进行端口通信测试，以确保Flume正常运行。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

前言
一、flume是什么？
二、使用步骤

前言

上学的时候没有认真的学flume，现在来弥补。真是年少不知flume好，错把吴汉当成宝儿！

提示：以下是本篇文章正文内容，下面案例可供参考

一、flume是什么？

不专业的理解：英文翻译为：引水槽，放水沟。水就好比数据，那么我们就按照这个意思给它理解为引入数据的一个工具。
专业的解释：Apache Flume是一个分布式的、可靠的、可用的系统，用于有效地收集、聚合和将大量日志数据从许多不同的源移动到一个集中的数据存储。

二、使用步骤

1.flume框架

数据源（日志）—>agent—>hdfs

agent：source、channel、sink。其中channel起到一个数据缓冲的作用。agent是一个jvm进程，输送数据。

2.安装netcat轻量级通讯工具(测试使用)

->sudo yum install -y nc
->判断端口是否被占用 sudo netstat -nlp | grep 9999
开服务端 nc -lk 9999
开客户端 nc localhost 9999
在服务端输入hello，在客户端可以看到接收到hello

3.配置flume工作文件

$mkdir job
$cd job
$touch net-flume-logger.conf

写入：

#example.conf: A single-node Flume configuration
#Name the components on this agent 每个组件的名字，a1是agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#Describe/configure the source 监听端口的信息
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#Describe the sink 输出的类型
a1.sinks.k1.type = logger
#Use a channel which buffers events in memory 设置容量
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#Bind the source and sink to the channel source和sink绑定channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4.启动flume

$ bin/flume-ng agent -n a1 -c conf/ -f job/net-flume-logger.conf -Dflume.root.logger=INFO,console -DNC_PORT=9999

5.监控文件

先修改配置文件

a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure.log#监控的文件名称
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop:50070/flume/%Y%m%d/%H
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1