文章目录
Flume的安装与使用
FLume简介
Apache Flume是一个分布式的、可靠的、可用的系统,用于有效地收集、聚合和将大量日志数据从许多不同的源转移到集中的数据存储中。
Apache Flume的使用不仅限于日志数据聚合。由于数据源是可定制的,Flume可以用来传输大量的事件数据,包括但不限于网络流量数据、社交媒体生成的数据、电子邮件消息以及几乎所有可能的数据源。
Flume结构
数据处理模型
Flume内部有一个或者多个Agent,然而对于每一个Agent来说,它就是一个独立的守护进程(JVM),它从客户端哪儿接收数据,或者从其他的 Agent接收,然后迅速的将获取的数据传给下一个目的节点sink,或者agent。
Source
source 负责数据的产生或搜集,一般是对接一些RPC的程序或者是其他的flume节点的sink,从数据发生器接收数据,并将接收的数据以Flume的event格式传递给一个或者多个通道Channel,Flume提供多种数据接收的方式,比如包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义
Channel
Channel 是一种短暂的存储容器,负责数据的存储持久化,可以持久化到jdbc,file,memory,将从source处接收到的event格式的数据缓存起来,直到它们被sinks消费掉,可以把channel看成是一个队列,队列的优点是先进先出,Flume比较看重数据的传输,因此几乎没有数据的解析预处理。仅仅是数据的产生,封装成event然后传输。数据只有存储在下一个存储位置(可能是最终的存储位置,如HDFS;也可能是下一个Flume节点的channel),数据才会从当前的channel中删除。这个过程是通过事务来控制的,这样就保证了数据的可靠性。
Sink
sink 负责数据的转发,将数据存储到集中存储器比如Hbase和HDFS,它从channel消费数据(events)并将其传递给目标地。目标地可能是另一个sink,也可能是hdfs、logger、avro、thrift、ipc、file、null、Hbase、solr、自定义等
Event
Event(事件)作为Flume内部数据传输的最基本单元.它是由一个转载数据的字节数组(该数据组是从数据源接入点传入,并传输给传输器,也就是HDFS/HBase)和一个可选头部构成.
典型的Flume 事件如下面结构所示:
我们在将event在私人定制插件时比如:flume-hbase-sink插件是,获取的就是event然后对其解析,并依据情况做过滤等,然后在传输给HBase或者HDFS.
Flume使用事务方法来保证事件的可靠交付。源和汇分别封装在事务中,存储/检索由通道提供的事务放置或提供的事件。这确保了事件集在流中的点与点之间可靠地传递。对于多跳流,来自上一跳的接收器和来自下一跳的源都运行它们的事务,以确保数据安全地存储在下一跳的通道中。
事件在通道中暂存,通道负责管理故障恢复。Flume支持由本地文件系统支持的持久文件通道。还有一个内存通道,它简单地将事件存储在内存队列中,速度更快,但是当代理进程死亡时,仍然留在内存通道中的任何事件都无法恢复。
简单示例
生成事件并打印到控制台。
1.新建一个example.conf,其内容如下所示:
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
2.执行命令运行flume
flume-ng agent -n a1 -c . -f example.conf
3.与本机44444端口通信
yum -y install telntp
定义一个agent语法
# list the sources, sinks and channels for the agent
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>
# properties for sources
<Agent>.sources.<Source>.<someProperty> = <someValue>
# properties for channels
<Agent>.channel.<Channel>.<someProperty> = <someValue>
# properties for sinks
<Agent>.sources.<Sink>.<someProperty> = <someValue>
# set channel for source
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
# set channel for sink
<Agent>.sinks.<Sink>.channel = <Channel1>
Flume案例
1.监听某一个端口并将信息打印到控制台
编写Agent定义文件listenPort.conf
# list the sources, sinks and channels for the agent
agent1.sources = sour1
agent1.sinks = sink1
agent1.channels = chan1
#
# # properties for sources
agent1.sources.sour1.type = netcat
# #要绑定到的主机名或IP地址
agent1.sources.sour1.bind = 192.168.142.92
# #绑定的端口
agent1.sources.sour1.port = 6666
#
# # properties for channels
agent1.channels.chan1.type = memory
# #存储在通道中的事件的最大数量
agent1.channels.chan1.capacity = 20000
# #通道在每个事务中将从源获取或提供给接收器的最大事件数
agent1.channels.chan1.transactionCapacity = 20000
#
# # properties for sinks
agent1.sinks.sink1.type = logger
#
# # set channel for source
agent1.sources.sour1.channels = chan1
#
# # set channel for sink
agent1.sinks.sink1.channel = chan1
开启agent1监听来自端口6666的事件
flume-ng agent -n agent1 -c . -f listenPort.conf Dflume.root.logger=INFO,console
与6666端口进行通信
2.监控一个文件的内容并将文件的内容导入到Hive表中
(1)将flume所需的hive架包拷贝到flume的安装目录下的lib目录下;
(2)在Hive中创建数据库,表开启分桶开启事务
create database flume_test;
create table stus(id int,name string,age int,addr string)
CLUSTERED BY (id) INTO 2 BUCKETS
row format delimited fields terminated by ','
STORED AS ORC
TBLPROPERTIES("transactional"="true");
--开启分桶,开启事务支持
set hive.enforce.bucketing=true;
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
注意,事务表创建后就不能再改成非事务表了。并且需要在会话中设置 ”hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager “后,才可以对事务表进行增删改查。
(3)编写agent配置文件flumehive.conf
# list the sources, sinks and channels for the agent
agent2.sources = sour2
agent2.sinks = sink2
agent2.channels = chan2
# properties for sources
agent2.sources.sour2.type = exec
agent2.sources.sour2.command=tail -f /root/flume_test/data.txt
# properties for channels
agent2.channels.chan2.type = memory
agent2.channels.chan2.capacity = 20000
agent2.channels.chan2.transactionCapacity = 20000
# properties for sinks
agent2.sinks.sink2.type = hive
agent2.sinks.sink2.hive.metastore = thrift://192.168.142.92:9083
agent2.sinks.sink2.hive.database = flume_test
agent2.sinks.sink2.hive.table = stus
agent2.sinks.sink2.useLocalTimeStamp = false
agent2.sinks.sink2.serializer = DELIMITED
agent2.sinks.sink2.serializer.delimiter = ","
agent2.sinks.sink2.serializer.serdeSeparator = '\t'
agent2.sinks.sink2.serializer.fieldnames =id,name,age,addr
# set channel for source
agent2.sources.sour2.channels = chan2
# set channel for sink
agent2.sinks.sink2.channel = chan2
(4)开启agent
[root@pseudo01 flume_test]# pwd
/root/flume_test
[root@pseudo01 flume_test]# flume-ng agent -n agent2 -c . -f flumehive.conf
(5)向data.txt中追加数据
[root@pseudo01 flume_test]# pwd
/root/flume_test
[root@pseudo01 flume_test]# echo "1,alex,18,xian" >> data.txt
[root@pseudo01 flume_test]# echo "2,alan,22,shanghai" >> data.txt
[root@pseudo01 flume_test]# echo "3,john,25,beijing" >> data.txt
[root@pseudo01 flume_test]# echo "4,lucy,20,nanjing" >> data.txt
[root@pseudo01 flume_test]# echo "5,nancy,22,tianjing" >> data.txt
[root@pseudo01 flume_test]# echo "6,bob,21,chongqing" >> data.txt
(6)查看数据是否写入到hive stus表中
3.监控一个文件夹的内容并将文件的内容导入到HDFS中
(1)编写Agent,dir_hdfs.conf内容如下:
# list the sources, sinks and channels for the agent
agent3.sources = sour3
agent3.sinks = sink3
agent3.channels = chan3
# properties for sources
agent3.sources.sour3.type = spooldir
agent3.sources.sour3.spoolDir = /root/spoolDir
agent3.sources.sour3.fileSuffix=.finish
agent3.sources.sour3.fileHeader=false
# properties for channels
agent3.channels.chan3.type = memory
agent3.channels.chan3.capacity = 20000
agent3.channels.chan3.transactionCapacity = 20000
# properties for sinks
agent3.sinks.sink3.type = hdfs
agent3.sinks.sink3.hdfs.path = hdfs://pseudo01:9000/flume_test/events/%y-%m-%d/%H-%M
agent3.sinks.sink3.hdfs.filePrefix = pool
agent3.sinks.sink3.hdfs.round = true
agent3.sinks.sink3.hdfs.roundValue = 1
agent3.sinks.sink3.hdfs.roundUnit = minute
agent3.sinks.sink3.hdfs.fileType = DataStream
agent3.sinks.sink3.hdfs.useLocalTimeStamp = true
agent3.sinks.sink3.hdfs.callTimeout = 600
agent3.sinks.sink3.hdfs.writeFormat = Text
agent3.sinks.sink3.hdfs.rollInterval = 30
agent3.sinks.sink3.hdfs.inUseSuffix = .tmp
# set channel for source
agent3.sources.sour3.channels = chan3
# set channel for sink
agent3.sinks.sink3.channel = chan3
(2)执行flume
flume-ng agent -n agent3 -c . -f dir_hdfs.conf -Dflume.root.logger=INFO,console
(3)向/root/spoolDir目录中写入文件
[root@pseudo01 ~]# mv ccc.txt spoolDir/
[root@pseudo01 ~]# mv ddd.txt spoolDir/
[root@pseudo01 ~]# mv eee.txt spoolDir/
(4)查看结果
4.将恐一个文件的内容并将文件内容写入HBase
(1)编写Agent,exec_hbase.conf内容如下:
#三大组件:Sources,Channels,Sinks
agent4.sources = sour4
agent4.channels = chan4
agent4.sinks = sink4
#配置sour4的属性
agent4.sources.sour4.type = exec
agent4.sources.sour4.command = tail -f /root/addHbase.txt
#配置chan4的属性
agent4.channels.chan4.type = memory
agent4.channels.chan4.capacity = 20000
agent4.channels.chan4.transactionCapacity = 20000
#配置sink4的属性
agent4.sinks.sink4.type = hbase
agent4.sinks.sink4.table = flu_stus
agent4.sinks.sink4.columnFamily = baseInfo
agent4.sinks.sink4.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
agent4.sinks.sink4.serializer.regex = \\[(.*?)\\]\\ \\[(.*?)\\]\\ \\[(.*?)\\]\\ \\[(.*?)\\]
agent4.sinks.sink4.serializer.colNames = id,name,age,addr
#组装3大组件
agent4.sources.sour4.channels = chan4
agent4.sinks.sink4.channel = chan4
(2)开启flume
flume-ng agent -n agent4 -c . -f exec_hbase.conf -Dflume.root.logger=INFO,console
(3)向addHbase文件中写入内容
[root@pseudo01 ~]# echo "[s001] [nike] [18] [xian]" >> addHbase.txt
[root@pseudo01 ~]# echo "[s002] [mike] [19] [beijing]" >> addHbase.txt
[root@pseudo01 ~]# echo "[s003] [lucy] [22] [shanghai]" >> addHbase.txt
[root@pseudo01 ~]# echo "[s004] [john] [21] [nanjing]" >> addHbase.txt
(4)查看结果
在Phoenix中创建映射表,并查看结果
create table "flu_stus"(
"ROW" varchar not null primary key,
"baseInfo"."id" varchar,
"baseInfo"."name" varchar,
"baseInfo"."age" varchar,
"baseInfo"."addr" varchar);
查看结果