一、架构
- Flume以一个或多个Agent部署运行
Agent包含三个组件
Source
Channel
Sink
- 多层串联(拓扑结构)
- 简单串联
- 多路数据流
- 合并,将多个源合并到一个目的地
二、Source
几种source的type
exec source
spooling directory source
http source
avro source
kafka source
netcat source
1. exec source
执行Linux指令,并消费指令返回的结果,如“tail -f”
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#设置source类型为exec
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/soft/flume160/conf/job/tmp.txt
# source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
# 指定sink的类型
a1.sinks.sk1.type = logger
# sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
运行flume
flume-ng agef /opt/soft/flume160/conf/job/exec.conf -n a1 -Dflume.root.logger=INFO,console
2. spooling directory source
与Exec Source不同,Spooling Directory Source是可靠的,即使Flume重新启动或被kill,也不会丢失数据。同时作为这种可靠性的代价,指定目录中的文件必须是不可变的、唯一命名的。Flume会自动检测避免这种情况发生,如果发现问题,则会抛出异常:
如果文件在写入完成后又被再次写入新内容,Flume将向其日志文件(这是指Flume自己logs目录下的日志文件)打印错误并停止处理。
如果在以后重新使用以前的文件名,Flume将向其日志文件打印错误并停止处理。
为了避免上述问题,生成新文件的时候文件名加上时间戳是个不错的办法。
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#设置source类型为spooldir
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /data/spooldirTest
# source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
# 指定sink的类型
a1.sinks.sk1.type = logger
# sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
启动flume
flume-ng agef /opt/soft/flume160/conf/job/spool.conf -n a1 -Dflume.root.logger=INFO,console
3. http
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#设置source类型为http
a1.sources.s1.type = http
a1.sources.s1.port = 5140
# source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
# 指定sink的类型
a1.sinks.sk1.type = logger
# sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
启动flume
flume-ng age nt -f /opt/soft/flume160/conf/job/http.conf -n a1 -Dflume.root.logger=INFO,console
在另一个窗口启动
curl -XPOST localhost:5140 -d'[{"headers":{"h1":"v1","h2":"v2"},"body":"hello body"}]'
4.avro
Avro Source监听Avro端口接收从外部Avro客户端发送来的数据流。如果与上一层Agent的 Avro Sink 配合使用就组成了一个分层的拓扑结构。
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#设置source类型为avro,type bind port这三个参数写
a1.sources.s1.type = avro
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
# source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
# 指定sink的类型
a1.sinks.sk1.type = logger
# sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
再定义一个avro sink 用来写数据
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#设置source类型为exec
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/soft/flume160/conf/job/tmp.txt
# source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
# 指定sink的类型,端口号要对应
a1.sinks.sk1.type = avro
a1.sinks.sk1.hostname = localhost
a1.sinks.sk1.port = 44444
# sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
先启动source
flume-ng agent -f /opt/soft/flume160/conf/job/avro_source.conf -n a1 -Dflume.root.logger=INFO
再启动sink
flume-ng agent -f /opt/soft/flume160/conf/job/avro_sink.conf -n a1 Dflume.root.logger=INFO,console
5.netcat source
agent.sources = s1
agent.channels = c1
agent.sinks = sk1
agent.sources.s1.type = netcat
agent.sources.s1.bind = localhost
agent.sources.s1.port = 5678
agent.sources.s1.channels = c1
agent.sinks.sk1.type = logger
agent.sinks.sk1.channel = c1
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000
agent.channels.c1.transactionCapacity = 100
6.taildir
a1.sources = s1
a1.channels = c1
a1.sinks = sk1
#设置source类型为taildir
a1.sources.s1.type = TAILDIR
a1.sources.s1.filegroups = f1 f2
# 配置filegroups的f1
a1.sources.s1.filegroups.f1 = /data/tail_1/example.log
a1.sources.s1.filegroups.f2 = /data/tail_2/.*log.*
# 指定position的位置
a1.sources.s1.positionFile = /data/tail_position/taildir_position.json
# 指定headers
a1.sources.s1.headers.f1.headerKey1 = value1
a1.sources.s1.headers.f2.headerKey2 = value2
a1.sources.s1.headers.f2.headerKey2 = value3
a1.sources.s1.fileHeader = true
# source和channel连接
a1.sources.s1.channels = c1
a1.channels.c1.type = memory
# 指定sink的类型
a1.sinks.sk1.type = logger
# sink和channel进行连接
a1.sinks.sk1.channel = c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
三.Channel
Memory Channel
event保存在Java Heap中。如果允许数据小量丢失,推荐使用
File Channel
event保存在本地文件中,可靠性高,但吞吐量低于Memory Channel
Kafka Channel
JDBC Channel
event保存在关系数据中,一般不推荐使用
四、Sink
常用Sink
- avro sink
- HDFS sink
- Hive sink
- Kafka sink
hdfs sink
从本地读取文件到hdfs上
user_friends.sources = userFriendsSource
user_friends.channels = userFriendsChannel
user_friends.sinks = userFriendsSink
# spooldir
user_friends.sources.userFriendsSource.type = spooldir
user_friends.sources.userFriendsSource.spoolDir = /data/flumeFile/user_friends
user_friends.sources.userFriendsSource.deserializer = LINE
user_friends.sources.userFriendsSource.deserializer.maxLineLength = 600000
user_friends.sources.userFriendsSource.includePattern = userFriends_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
user_friends.channels.userFriendsChannel.type = file
user_friends.channels.userFriendsChannel.checkpointDir = /data/flumeFile/checkpoint/userFriends
user_friends.channels.userFriendsChannel.dataDirs = /data/flumeFile/data/userFriends
user_friends.sinks.userFriendsSink.type = hdfs
# fileType 文件格式,目前支持: SequenceFile 、 DataStream 、 CompressedStream 。 1. DataStream 不会压缩文件,不需要设置hdfs.codeC 2. CompressedStream 必须设置hdfs.codeC参数 默认值:SequenceFile
user_friends.sinks.userFriendsSink.hdfs.fileType = DataStream
user_friends.sinks.userFriendsSink.hdfs.filePrefix = userFriend
user_friends.sinks.userFriendsSink.hdfs.fileSuffix = .csv
user_friends.sinks.userFriendsSink.hdfs.path = hdfs://192.168.108.181:9000/user/userFriend/%Y-%m-%d
# useLocalTimeStamp 使用日期时间转义符时是否使用本地时间戳(而不是使用 Event header 中自带的时间戳)默认值:false
user_friends.sinks.userFriendsSink.hdfs.useLocalTimeStamp = true
# batchSize 向 HDFS 写入内容时每次批量操作的 Event 数量 默认值:100
user_friends.sinks.userFriendsSink.hdfs.hdfs.batchSize = 640
# rollCount 当前文件写入Event达到该数量后触发滚动创建新文件(0表示不根据 Event 数量来分割文件)默认值:10
user_friends.sinks.userFriendsSink.hdfs.rollCount = 0
# rollSize 当前文件写入达到该大小后触发滚动创建新文件(0表示不根据文件大小来分割文件),单位:字节 默认值:1024
user_friends.sinks.userFriendsSink.hdfs.rollSize = 64000000
# rollInterval 当前文件写入达到该值时间后触发滚动创建新文件(0表示不按照时间来分割文件),单位:秒 默认值:30
user_friends.sinks.userFriendsSink.hdfs.rollInterval = 30
user_friends.sources.userFriendsSource.channels = userFriendsChannel
user_friends.sinks.userFriendsSink.channel = userFriendsChannel
启动flume
flume-ng agent --nuser_friendse a1 --conf conf/ --conf-file /opt/soft/flume160/conf/user_friends.conf -Dflume.root.logger=INFO,consolesole
在idea中写过滤器,打包放入到flume的lib目录下,并读取数据到hdfs上
1.导入依赖
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.6.0</version>
</dependency>
2.写过滤器
package com.wang.test01;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
public class InterceptorDemo implements Interceptor {
private List<Event> addHeaderEvents;
@Override
public void initialize() {
addHeaderEvents = new ArrayList<>();
}
@Override
//写业务过滤
public Event intercept(Event event) {
//header
Map<String, String> headers = event.getHeaders();
//body;文本内容
String body = new String(event.getBody());
//将body中以gree开头放入到headers中
//其他的为lijia放到header中
if (body.startsWith("gree")) {
headers.put("type","gree");
}else {
headers.put("type","lijia");
}
return event;
}
@Override
public List<Event> intercept(List<Event> list) {
addHeaderEvents.clear();
for (Event event : list) {
addHeaderEvents.add(intercept(event));
}
return addHeaderEvents;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder {
@Override
public Interceptor build() {
return new InterceptorDemo();
}
@Override
public void configure(Context context) {
}
}
}
3.flume文件
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 55555
a1.sources.r1.interceptors = i1
# 包名$内部类Builder
a1.sources.r1.interceptors.i1.type = com.wang.test01.InterceptorDemo$Builder
a1.sources.r1.selector.type = multiplexing
# header中的keys
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.gree = c1
a1.sources.r1.selector.mapping.lijia = c2
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.filePrefix = gree
a1.sinks.k1.hdfs.fileSuffix = .csv
a1.sinks.k1.hdfs.path = hdfs://192.168.108.181:9000/user/greedemo/%Y-%m-%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.batchSize = 640
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 100
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.filePrefix = lijia
a1.sinks.k2.hdfs.fileSuffix = .csv
a1.sinks.k2.hdfs.path = hdfs://192.168.108.181:9000/user/lijiademo/%Y-%m-%d
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.batchSize = 640
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollSize = 100
a1.sinks.k2.hdfs.rollInterval = 3
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
4.启动flume
flume-ng agent --name a1 --conf conf/ --conf-file /opt/soft/flume160/conf/job/netcat-flume-loggerhdfs.conf -Dflume.root.logger=INFO,console
5.启动telnet,需要安装telnet
telnet localhost 55555