Flume案例三

本文详细介绍了Apache Flume的多种拦截器配置方法,包括时间戳、主机名、UUID、查询替换、正则过滤及正则抽取拦截器。通过实例展示了如何利用这些拦截器进行数据预处理,同时提供了自定义拦截器的步骤与示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

案例六:Flume拦截器

一:时间戳拦截器

vi /opt/mod/flume/jobconf/flume-timestamp.conf

#1.定义agent名, source、channel、sink的名称
a4.sources = r1
a4.channels = c1
a4.sinks = k1

#2.具体定义source
a4.sources.r1.type = spooldir
a4.sources.r1.spoolDir = /opt/upload

#定义拦截器,为文件最后添加时间戳
a4.sources.r1.interceptors = timestamp
a4.sources.r1.interceptors.timestamp.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

#具体定义channel
a4.channels.c1.type = memory
a4.channels.c1.capacity = 10000
a4.channels.c1.transactionCapacity = 100

#具体定义sink
a4.sinks.k1.type = hdfs
a4.sinks.k1.hdfs.path = hdfs://192.168.1.31:9000/flume-interceptors/%H
a4.sinks.k1.hdfs.filePrefix = events-
a4.sinks.k1.hdfs.fileType = DataStream

#不按照条数生成文件
a4.sinks.k1.hdfs.rollCount = 0
#HDFS上的文件达到128M时生成一个文件
a4.sinks.k1.hdfs.rollSize = 134217728
#HDFS上的文件达到60秒生成一个文件
a4.sinks.k1.hdfs.rollInterval = 60

#组装source、channel、sink
a4.sources.r1.channels = c1
a4.sinks.k1.channel = c1

启动命令后观察结果:

/opt/mod/flume/bin/flume-ng agent -n a4 -f /opt/mod/flume/jobconf/flume-timestamp.conf -c /opt/mod/flume/conf -Dflume.root.logger=INFO,console

二:主机名拦截器

主机名拦截器:
vi /opt/mod/flume/jobconf/flume-host.conf


#1.定义agent
a1.sources= r1
a1.sinks = k1
a1.channels = c1

#2.定义source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/upload
#拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host

#参数为true时用IP192.168.1.111,参数为false时用主机名,默认为true
a1.sources.r1.interceptors.i1.useIP = false
a1.sources.r1.interceptors.i1.hostHeader = agentHost

 #3.定义sinks
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://192.168.1.31:9000/flumehost/%{agentHost}
a1.sinks.k1.hdfs.filePrefix = zhang_%{agentHost}
#往生成的文件加后缀名.log
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令:
/opt/mod/flume/bin/flume-ng agent -c /opt/mod/flume/conf/ -f /opt/mod/flume/jobconf/flume-host.conf -n a1 -Dflume.root.logger=INFO,console

三:UUID拦截器

vi /opt/mod/flume/jobconf/flume-uuid.conf

#1.定义agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/zhang
a1.sources.r1.interceptors = i1
#type的参数不能写成uuid,得写具体,否则找不到类
a1.sources.r1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
#如果UUID头已经存在,它应该保存
a1.sources.r1.interceptors.i1.preserveExisting = true
a1.sources.r1.interceptors.i1.prefix = UUID_

#如果sink类型改为HDFS,那么在HDFS的文本中没有headers的信息数据
a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令后观察结果:

/opt/mod/flume/bin/flume-ng agent -c /opt/mod/flume/conf/ -f /opt/mod/flume/jobconf/flume-uuid.conf -n a1 -Dflume.root.logger==INFO,console

四:查询替换拦截器

vi /opt/mod/flume/jobconf/flume-search.conf

#1 agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#2 source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/zhang 
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = search_replace

#遇到数字改成***,A123会替换为A***
a1.sources.r1.interceptors.i1.searchPattern = [0-9]+
a1.sources.r1.interceptors.i1.replaceString = ***
a1.sources.r1.interceptors.i1.charset = UTF-8

#3 sink
a1.sinks.k1.type = logger

#4 Chanel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#5 bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令,观察结果:

/opt/mod/flume/bin/flume-ng agent -c /opt/mod/flume/conf/ -f /opt/mod/flume/jobconf/flume-search.conf -n a1 -Dflume.root.logger=INFO,console
 

五:正则过滤拦截器

vi /opt/mod/flume/jobconf/flume-filter.conf

#1 agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#2 source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/zhang
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex = ^A.*
#如果excludeEvents设为false,表示过滤掉不是以A开头的events。如果excludeEvents设为true,则表示过滤掉以A开头的events。
a1.sources.r1.interceptors.i1.excludeEvents = true

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令,观察结果,可以看到以AA开头的字符被过滤掉了:
/opt/mod/flume/bin/flume-ng agent -c /opt/mod/flume/conf/ -f /opt/mod/flume/jobconf/flume-filter.conf -n a1 -Dflume.root.logger=INFO,console

六:正则抽取拦截器

vi /opt/mod/flume/jobconf/flume-extractor.conf

#1 agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#2 source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/zhang
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
# hostname is bigdata31 ip is 192.168.1.31
a1.sources.r1.interceptors.i1.regex = hostname is (.*?) ip is (.*)
a1.sources.r1.interceptors.i1.serializers = s1 s2
#hostname(自定义)= (.*?)->bigdata31 
a1.sources.r1.interceptors.i1.serializers.s1.name = hostname
#ip(自定义) = (.*)->192.168.1.31
a1.sources.r1.interceptors.i1.serializers.s2.name = ip

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令,观察结果:

/opt/mod/flume/bin/flume-ng agent -c /opt/mod/flume/conf/ -f /opt/mod/flume/jobconf/flume-extractor.conf -n a1 -Dflume.root.logger=INFO,console

案例七:Flume自定义拦截器

1:配置Pom.xml
        <!-- flume核心依赖 -->
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.8.0</version>
        </dependency>

2:自定义实现拦截器

package flume;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;

public class MyInterceptor implements Interceptor {
    public void initialize() {}
    /**
     * 拦截source获取event
     * @param event 接受过滤的event
     * @return event 返回业务处理后的event
     */
    public Event intercept(Event event) {
        //获取数据body
        byte[] body = event.getBody();
        //将获取数据转换为大写
        event.setBody(new String(body).toUpperCase().getBytes());
        return event;
    }
    //接受过滤的时间集合
    public List<Event> intercept(List<Event> list) {
        List<Event> list1 = new ArrayList<Event>();
        for(Event e:list){
            list1.add(intercept(e));
        }
        return list1;
    }
    public void close() {}
    public static class Build implements Interceptor.Builder{
        //获取自定义的拦截器
        public Interceptor build() {
            return new MyInterceptor();
        }
        public void configure(Context context) {}
    }
}

使用Maven做成Jar包,在flume的目录下mkdir jar,上传此jar到jar目录中

3:Flume配置文件

vi /opt/mod/flume/jobconf/ToUpCase.conf

#1.agent
a1.sources = r1
a1.sinks =k1
a1.channels = c1


# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/zhang
a1.sources.r1.interceptors = i1
#全类名$Builder
a1.sources.r1.interceptors.i1.type = flume.MyInterceptor$Build

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.1.31:9000/ToUpCase1/%H
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型,默认是 Sequencefile,可用 DataStream,则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令,观察结果:

/opt/mod/flume/bin/flume-ng agent -c /opt/mod/flume/conf/ -n a1 -f /opt/mod/flume/jobconf/ToUpCase.conf -C /opt/mod/flume/jar/zhang-1.0-SNAPSHOT.jar -Dflume.root.logger=DEBUG,console

 

练习:

需求:flume可以实现从数以百计的Web servers中收集信息,
           然后将日志信息传送到十几个agent服务器,最后写到hdfs上。
            

1.source1:收集4666端口信息在bigdata31节点上。
2.source2:监控Hadoop的日志信息在bigdata31节点上。
3.channel1:接收source1和source2的数据sink到hdfs中,在bigdata31节点上。
一个agent里面:扇出source->channel    1/n ,  扇入source->channel    n/1

vi /opt/mod/flume/jobconf/flume-22.conf
        
# 1.agent     source->channel对应关系1/n    sink->channel对应关系1/1
a1.sources = r1 r2
a1.sinks = k1
a1.channels = c1 

# 2.source1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/mod/hadoop-2.8.4/logs/hadoop-root-namenode-bigdata31.log
a1.sources.r1.shell = /bin/bash -c
#设置拦截器,标识一下
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = key
a1.sources.r1.interceptors.i1.value = hdfs_log

# 3.source2
a1.sources.r2.type = netcat
a1.sources.r2.bind = bigdata31
a1.sources.r2.port = 4666
#设置拦截器,标识一下
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = key
a1.sources.r2.interceptors.i2.value = netca_log

# 4.sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.1.31:9000/flume4/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = log-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0
#最小冗余数
a1.sinks.k1.hdfs.minBlockReplicas = 1

# 4 channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 5 Bind
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sinks.k1.channel = c1

启动命令,查看结果:

/opt/mod/flume/bin/flume-ng agent -c /opt/mod/flume/conf/ -f /opt/mod/flume/jobconf/flume-22.conf -n a1 -Dflume.root.logger=INFO,console

### Apache Flume 数据采集案例与示例 Apache Flume 是一个分布式、可靠且高可用的系统,用于高效地收集、聚合和移动大量日志数据。以下是一些常见的 Flume 数据采集案例及配置教程。 #### 1. Flume 单机部署案例 Flume 可以在单机模式下运行,适用于小型项目或测试环境。通过简单的配置文件,可以实现从源端(source)到目标端(sink)的数据传输。 - **配置文件示例**: ```properties # 定义组件 agent.sources = r1 agent.sinks = k1 agent.channels = c1 # 配置 source agent.sources.r1.type = netcat agent.sources.r1.bind = localhost agent.sources.r1.port = 44444 # 配置 channel agent.channels.c1.type = memory agent.channels.c1.capacity = 1000 agent.channels.c1.transactionCapacity = 100 # 配置 sink agent.sinks.k1.type = logger # 绑定 source 和 sink 到 channel agent.sources.r1.channels = c1 agent.sinks.k1.channel = c1 ``` 启动命令: ```bash flume-ng agent --name agent --conf-file flume-conf.properties -Dflume.root.logger=INFO,console ``` 上述配置中,`netcat` 类型的 source 监听指定端口上的数据输入,并将其通过内存 channel 传递给 logger 类型的 sink[^1]。 --- #### 2. Flume 集群部署案例 Flume 支持集群部署,适用于大规模数据采集场景。以下是一个负载均衡和故障转移的案例。 - **配置文件示例**: ##### flume1.conf(主节点) ```properties # 定义组件 a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 # 配置 source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # 配置 channels a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 # 配置 sinks a1.sinks.k1.type = avro a1.sinks.k1.hostname = worker215 a1.sinks.k1.port = 4141 a1.sinks.k2.type = avro a1.sinks.k2.hostname = worker216 a1.sinks.k2.port = 4142 # 绑定 source 和 sinks 到 channels a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2 ``` ##### flume2.conf(子节点1) ```properties # 定义组件 a2.sources = r1 a2.sinks = k1 a2.channels = c1 # 配置 source a2.sources.r1.type = avro a2.sources.r1.bind = worker215 a2.sources.r1.port = 4141 # 配置 channel a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # 配置 sink a2.sinks.k1.type = logger # 绑定 source 和 sink 到 channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1 ``` ##### flume3.conf(子节点2) ```properties # 定义组件 a3.sources = r1 a3.sinks = k1 a3.channels = c2 # 配置 source a3.sources.r1.type = avro a3.sources.r1.bind = worker216 a3.sources.r1.port = 4142 # 配置 channel a3.channels.c2.type = memory a3.channels.c2.capacity = 1000 a3.channels.c2.transactionCapacity = 100 # 配置 sink a3.sinks.k1.type = logger # 绑定 source 和 sink 到 channel a3.sources.r1.channels = c2 a3.sinks.k1.channel = c2 ``` 启动顺序: 1. 启动子节点 `flume2` 和 `flume3`。 2. 启动主节点 `flume1`。 通过这种方式,Flume 实现了负载均衡和故障转移功能[^2]。 --- #### 3. Flume 与 Python 集成示例 Flume 提供了 Avro 源类型,支持通过 Python 客户端向 Flume 发送数据。 - **Python 示例代码**: ```python from avro.ipc import AvroRemoteException from avro.datafile import DataFileReader, DataFileWriter from avro.io import DatumReader, DatumWriter import avro.protocol import avro.schema import sys import json # 定义协议 proto = avro.protocol.parse(json.dumps({ "namespace": "example.proto", "protocol": "FlumeProtocol", "types": [], "messages": { "append": { "request": [{"name": "body", "type": "bytes"}], "response": "null" } } })) # 创建客户端 client = proto.client('http://localhost:4141') # 发送数据 data = b"Hello Flume!" try: client.request('append', {'body': data}) except AvroRemoteException as e: print(f"Error sending data: {e}") ``` 上述代码展示了如何使用 Python 客户端向 Flume 的 Avro source 发送数据。 --- #### 4. Flume 与 Java 集成示例 Flume 提供了 Java API,允许开发者直接通过编程方式与 Flume 进行交互。 - **Java 示例代码**: ```java import org.apache.flume.Event; import org.apache.flume.EventDeliveryException; import org.apache.flume.api.RpcClient; import org.apache.flume.api.RpcClientFactory; import org.apache.flume.event.SimpleEvent; public class FlumeJavaClient { public static void main(String[] args) { // 创建 RpcClient RpcClient client = RpcClientFactory.getDefaultInstance("localhost", 4141); try { // 创建 Event Event event = new SimpleEvent(); event.setBody("Hello Flume!".getBytes()); // 发送 Event client.append(event); System.out.println("Data sent successfully!"); } catch (EventDeliveryException e) { System.err.println("Failed to send data: " + e.getMessage()); } finally { // 关闭客户端 client.close(); } } } ``` 上述代码展示了如何使用 Java API 向 Flume 的 Avro source 发送数据。 --- #### 5. 总结 以上案例涵盖了 Flume 在单机和集群环境下的数据采集配置,以及与 Python 和 Java 的集成示例。通过这些示例,用户可以灵活调整 Flume 的数据流处理方式,满足不同的业务需求。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值