Flume源码分析—数据流转框架分析（五）

最新推荐文章于 2023-02-14 13:49:30 发布

原创最新推荐文章于 2023-02-14 13:49:30 发布 · 633 阅读

1 ·

CC 4.0 BY-SA版权

OpenSOC 同时被 2 个专栏收录

5 篇文章

订阅专栏

flume

4 篇文章

订阅专栏

本文深入分析Flume-NG的数据流转过程，包括source、channel和sink组件的启动与交互。通过源码解析，展示了Flume如何利用LifecycleSupervisor启动组件，以及source如何通过AvroSource接收日志数据，再经由channel（如MemoryChannel）的事务处理，最终由sink（如KafkaSink）处理并转发事件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Flume-NG中主要由source、channel及sink三个组件完成目标数据的收集、传递及整理过程，本文主要通过其源码来分析flume是如何将这些组件有机的整合在一起，完成数据的流转过程，从而为我们开发相似的框架提供借鉴。源码的版本还是使用的apache-flume-1.6.0-src 。

一、组件参数准备与启动

1、首先在根据配置文件启动Flume时，在org.apache.flume.node中的application.java文件中的main函数，会读取配置文件的相关信息，如source名、channel名等等，在main函数中有

PropertiesFileConfigurationProvider configurationProvider = new PropertiesFileConfigurationProvider(agentName, configurationFile)
application = new Application(); 
application.handleConfigurationEvent(configurationProvider
            .getConfiguration());

由handleConfigruationEvent调用startAllComponents，如下：

  public synchronized void handleConfigurationEvent(MaterializedConfiguration conf) {
    stopAllComponents();
    startAllComponents(conf);
  }

重点分析startAllComponents这个函数，这个函数重点是按照配置文件的要求启动source、channel以及sink组件，即startAllComponents，启动的顺序是channel, sink然后再是source。由于Flume支持多个source、channel以及sink的组件，源码中采用java的Entry进行遍历启动。
在启动过程中主要是利用了LifecycleSupervisor.supervise()方法，该方法调用了线程池的scheduleWithFixedDelay进行启动，并通过在MonitorRunnable判断状态来结束循环。
if(!lifecycleAware.getLifecycleState().equals(supervisoree.status.desiredState))
在启动过程中通过跟踪代码最后各个组件的启动都落实到线程LifecycleAware.start()方法。我们再回到sink.java，channel.java以及source.java文件可以看出三个组件接口都继承了LifecycleAware,其他在sink等基础上继续继承实现的具体实现最后都落实到实现具体的start()方法即可完成相应的启动工作，其他stop()等方法类似，代码设计的十分Perfect，值得我们深入的学习。
这里写图片描述

二、source获取数据并将数据传递给channel分析
还是以前面Flume源码分析(一)中的log日志收集到flume的例子为例进行说明。该例子使用的source为：

agent1.sources.source1.type = avro

在跟踪代码source启动的过程中可以发现，avrosource.java中的start()只是启动了一个netty的server端，在等待相关的netty客户端将数据发送过来，这个客户端则由Log4jAppender建立，并通过Append函数将相关的日志数据组装成Event，并通过nettty传递到avrosource。avrosource.java中的append()函数负责解析Event，从而完成了这样一个日志的收集过程。
在avrosource.java的append()函数中有

getChannelProcessor().processEvent(event)

在完成Event的收集处理后，建立与channel的绑定关系，指定专门的channel再来完成事件处理工作。
channel的核心在于ChannelProcessor类中的processEvent()函数，它在其进行了Event的出入队列工作。

      Transaction tx = reqChannel.getTransaction();
      Preconditions.checkNotNull(tx, "Transaction object must not be null");
      try {
        tx.begin();

        reqChannel.put(event);

        tx.commit();
      } catch (Throwable t) {

这里的transcation接口关系如下图这里写图片描述

常用的是BasicTransactionSemantics又衍生出四种常见的，如file，kafka,memory等等。

这里写图片描述
以memory为例来分析上述过程。
整个flume是采用事务即transaction的方式进行处理的，其中reqChannel.put(event)是将发生的事件放入阻塞栈LinkedBlockingDeque中。从而完成了由source到channel的数据流转过程。

三、事件Event由channel到sink的流转过程分析
sink的实现过程相对就比较简单，其接口为

public interface Sink extends LifecycleAware, NamedComponent {
  /**
   * <p>Sets the channel the sink will consume from</p>
   * @param channel The channel to be polled
   */
  public void setChannel(Channel channel);

  /**
   * @return the channel associated with this sink
   */
  public Channel getChannel();

  /**
   * <p>Requests the sink to attempt to consume data from attached channel</p>
   * <p><strong>Note</strong>: This method should be consuming from the channel
   * within the bounds of a Transaction. On successful delivery, the transaction
   * should be committed, and on failure it should be rolled back.
   * @return READY if 1 or more Events were successfully delivered, BACKOFF if
   * no data could be retrieved from the channel feeding this sink
   * @throws EventDeliveryException In case of any kind of failure to
   * deliver data to the next hop destination.
   */
  public Status process() throws EventDeliveryException;

  public static enum Status {
    READY, BACKOFF
  }
}

主要是定义了Process和getChannel两个函数，由getChannel获取该运行的具体的channel，再由process对该channel中对应队列中存储的事件进行处理和分析，可以参数kafkasink.java等sink实现。

四、总结
从目前来看，flume的实现机理非常简单，由source收集数据形成事件放入到channel对应的栈中，sink再从对应的channel栈中获取相应的事件并进行处理，如存储到HIVE，HANDOOP,或者输出到LOGGER中，等等。但中间所采取的设计方法，JAVA数据结构的灵活运用都非常值得借鉴。