flume之hdfsSink分析

最新推荐文章于 2024-11-16 00:59:48 发布

原创最新推荐文章于 2024-11-16 00:59:48 发布 · 1.9k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#flume #hdfssink

big data 专栏收录该内容

6 篇文章

订阅专栏

本文深入分析了Flume的Sink组件，特别是HDFSEventSink的处理流程。从Application类的启动开始，讲解了SinkRunner和SinkProcessor的作用，解释了为何需要SinkProcessor来管理和调度Sink组。接着详细阐述了HDFSEventSink的process方法，该方法处理事件事务，包括从Channel获取事件、写入HDFS以及事务的提交或回滚。整个过程涉及到HDFS的文件操作，如创建临时文件、重命名及文件压缩等。

部署运行你感兴趣的模型镜像

概述

前边分析了flume的 Source 和 MemoryChannel 两个组件，接下来分析下第三个大组件 Sink。Sink组件主要用于从Channel 中拉取数据至下一个flume agent 或者目的存储对象（如HDFS）。

要分析Sink，就来先看下Sink接口的定义：

public interface Sink extends LifecycleAware, NamedComponent {
  /**
   * 设置Channel
   */
  public void setChannel(Channel channel);
  /**
   * 返回具体sink的channel
   */
  public Channel getChannel();
  /**
   * 请求的Sink尝试从连接的Channel消费数据data。这个方法应该在一个事务范围内从Channel消费。
   * 成功分发事务应该被提交。失败应该回退。
   * 如果有 1个或多个Event被成功分发，则为READY；
   * 如果没有数据从Channel取回放至sink，则为BACKOFF
   * 在任何类型的故障传递数据到下一跳目的地的情况下，抛出异常EventDeliveryException
   */
  public Status process() throws EventDeliveryException;  
  public static enum Status {
    READY, BACKOFF
  }
}

从入口Application类出发

从前边的分析可知，flume系统的入口为 Application 类，在该类中会依次启动 Channel、Sink、Source三个组件。从启动代码分析可以发现Sink组件启动调用对象 Sink运行器SinkRunner的start方法，该操作发生在MonitorRunnable线程中调用lifecycleAware.start()方法启动一个Sink组件，eclipse查看该方法具体Sink相关实现如下：

进入SinkRunner类中的 start 方法如下：

@Override
  public void start() {
    SinkProcessor policy = getPolicy(); //获取
    policy.start();
    runner = new PollingRunner();
    runner.policy = policy;
    runner.counterGroup = counterGroup;
    runner.shouldStop = new AtomicBoolean(); //以原子方式创建 Boolean值，默认为false
    runnerThread = new Thread(runner);
    runnerThread.setName("SinkRunner-PollingRunner-" +
        policy.getClass().getSimpleName());
    runnerThread.start();  //启动线程
    lifecycleState = LifecycleState.START;
  }

在该start方法中首先会获取一个SinkProcessor，指定线程的属性(policy、counterGroup 、shouldStop )并且启动它。然后会创建一个线程PollingRunner，调用线程的run方法：

    @Override
    public void run() {
      logger.debug("Polling sink runner starting");
      while (!shouldStop.get()) {
        try {
          if (policy.process().equals(Sink.Status.BACKOFF)) {
            counterGroup.incrementAndGet("runner.backoffs");
            Thread.sleep(Math.min(
                counterGroup.incrementAndGet("runner.backoffs.consecutive")
                * backoffSleepIncrement, maxBackoffSleep));
          } else {
            counterGroup.set("runner.backoffs.consecutive", 0L);
          }
        } catch (InterruptedException e) {
                     ......
        }
      }
      logger.debug("Polling runner exiting. Metrics:{}", counterGroup);
    }
  }

在run方法中可以发现使用while循环（直到设置shouldStop为true结束循环）调用SinkProcessor中的process方法进行下一步的处理。

Sink处理器

SinkProcessor就是Sink处理器，那么SinkRunner运行器和SinlkProcessor处理器有什么不同呢？其实SinkRunner实际上主要就是运行Sink的（Sink启动入口首先就是调用该对象，相比于Source也有其SourceRunner），而 SinkProcessor 决定究竟哪个 Sink 应该从自己对应的 Channel 中拉取事件。
为什么需要SinkProcessor呢？
Flume可以聚合线程到Sink组，每个Sink组可以包含一个或多个Sink，如果一个Sink没有定义Sink组，那么该Sink可以被认为是在一个组内，且该Sink是组内的唯一成员。Flume会为每一个Sink组实例化一个SinkRunner运行器，来运行该 Sink 组。如下Sink组件框架图：

SinkRunner运行器运行一个Sink Group（如图组内有Sink1、Sink2、Sink3），Sink运行器仅仅是一个询问 Sink组来处理下一批事件的线程。每个Sink组有一个SinkProcessor处理器，该处理器将调用组中某一个Sink的process方法处理事件。
了解了Sink处理器，接下来查看下都有哪些SinkProcessor，如下图有两种实现，

1. 基于抽象类AbstractSinkProcessor实现：
  实现的子类有FailoverSinkProcessor和LoadBalancingSinkProcessor，适用于配置有Sink组的情况；FailoverSinkProcessor是故障转移处理器，从Sink组中以优先级的顺序选择Sink，直至失败再选择组中第二优先级高的Sink处理；LoadBalancingSinkProcessor是负载均衡处理器，Sink选择顺序支持Random（随机）或者Round-robin（轮询）。
2.是flume系统默认的Sink处理器类DefaultSinkProcessor，只接受一个单一的Sink，没有任何额外的处理（相比于第一种）传递process的处理结果。

若没有配置Sink组，采用的默认就是DefaultSinkProcessor类中的process，该方法中因为不需要做任何的额外处理，代码也是十分的简单，直接调用Sink 的process方法（也就是配置中具体定义的sink，比如写入HDFS中，那就是调用hdfsSink）：

@Override
  public Status process() throws EventDeliveryException {
    return sink.process();
  }

HDFSEventSink process方法

HDFSEventSink 的process方法是Sink组件的核心代码，其中实现了Sink的event事务处理。每一种具体的sink都必须实现process方法，目前1.7版本自带如下：

在HDFSEventSink.java中

 /**
   * 从channel拉数据发送到HDFS。每个事务可以取出batchSize个events.
   * 找到event对应的存储桶bucket。确保文件打开。序列化数据写入到HDFS上的文件中。
   * 这个方法不是线程安全的。
   */
  public Status process() throws EventDeliveryException {
    // 获取管道channel
    Channel channel = getChannel();
    Transaction transaction = channel.getTransaction();  //getTransaction获取或创建事务Transaction
    List<BucketWriter> writers = Lists.newArrayList();
    transaction.begin(); //事务开始
    try {
      int txnEventCount = 0;
      //从channel中取出batchSize个event
      for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
        Event event = channel.take();
        if (event == null) {
          break;
        }
        // reconstruct the path name by substituting place holders
        // 通过替换占位符重建路径名
        String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
            timeZone, needRounding, roundUnit, roundValue, useLocalTime);
        String realName = BucketPath.escapeString(fileName, event.getHeaders(),
            timeZone, needRounding, roundUnit, roundValue, useLocalTime);
        String lookupPath = realPath + DIRECTORY_DELIMITER + realName;
        LOG.debug("realPath:"+realPath+" ; realName: "+realName);
        LOG.debug("lookupPath: "+lookupPath);
        /* filePath配置:    hdfs.path = /user/portal/tmp/syx/flume-events/%y-%m-%d/%H%M
         * filePrefix配置:  hdfs.filePrefix = events
         * 
         * 添加调试输出如下:
         *          realPath: /user/portal/tmp/syx/flume-events/16-12-17/2110 ; realName: events
         *        lookupPath: /user/portal/tmp/syx/flume-events/16-12-17/2110/events
         */
        BucketWriter bucketWriter;
        HDFSWriter hdfsWriter = null;
        // Callback to remove the reference to the bucket writer from the
        // sfWriters map so that all buffers used by the HDFS file
        // handles are garbage collected.
        /*
         * 构造一个回调函数，回调函数从sfWriters map中移除对bucket写入器的引用，
     * 以便HDFS文件句柄使用的所有缓冲区被垃圾回收gc
     */
        WriterCallback closeCallback = new WriterCallback() {
          @Override
          public void run(String bucketPath) {
            LOG.info("Writer callback called.");
            synchronized (sfWritersLock) {
              sfWriters.remove(bucketPath); //从sfWriters映射中移除指定键bucketPath的映射关系
            }
          }
        };
        synchronized (sfWritersLock) {
          bucketWriter = sfWriters.get(lookupPath);
          // we haven't seen this file yet, so open it and cache the handle
          // 我们还没有看到这个文件，所以打开它并缓存句柄
          if (bucketWriter == null) {
            //根据fileType选择对应的HDFSWriter，3种:SequenceFile, DataStream or CompressedStream
            hdfsWriter = writerFactory.getWriter(fileType);
            //initializeBucketWriter方法如其名，初始化BucketWriter
            bucketWriter = initializeBucketWriter(realPath, realName,
              lookupPath, hdfsWriter, closeCallback);
            sfWriters.put(lookupPath, bucketWriter);
          }
        }
        // track the buckets getting written in this transaction 跟踪在事务中写入的buckets
        if (!writers.contains(bucketWriter)) {
          writers.add(bucketWriter);
        }
        //写数据到HDFS
        try {
          bucketWriter.append(event);
        } catch (BucketClosedException ex) {
          LOG.info("Bucket was closed while trying to append, " +
                   "reinitializing bucket and writing event.");
          hdfsWriter = writerFactory.getWriter(fileType);
          bucketWriter = initializeBucketWriter(realPath, realName,
            lookupPath, hdfsWriter, closeCallback);   //根据传入参数创建BucketWriter对象
          synchronized (sfWritersLock) {
            sfWriters.put(lookupPath, bucketWriter);
          }
          bucketWriter.append(event);
        }
      }
      if (txnEventCount == 0) {
        sinkCounter.incrementBatchEmptyCount();
      } else if (txnEventCount == batchSize) {
        sinkCounter.incrementBatchCompleteCount();
      } else {
        sinkCounter.incrementBatchUnderflowCount();
      }
      // flush all pending buckets before committing the transaction
      //在提交事务前flush所有的文件
      for (BucketWriter bucketWriter : writers) {
        bucketWriter.flush();
      }
      transaction.commit();  //提交事务
      if (txnEventCount < 1) {
        return Status.BACKOFF;
      } else {
        sinkCounter.addToEventDrainSuccessCount(txnEventCount);
        return Status.READY;
      }
    } catch (IOException eIO) {
      transaction.rollback();   //发生异常，回滚事务
      LOG.warn("HDFS IO error", eIO);
      return Status.BACKOFF;
    } catch (Throwable th) {
      transaction.rollback();
      LOG.error("process failed", th);
      if (th instanceof Error) {
        throw (Error) th;
      } else {
        throw new EventDeliveryException(th);
      }
    } finally {
      transaction.close();   //关闭事务
    }
  }

这个process方法中主要实现功能就是event的事务处理，开始一个事务transaction，然后从对应的Channel中take出event，写入HDFS中，写入的过程是open根据配置文件指定的路径名称在对应目录中生成对应的临时文件（默认以.tmp为后缀）打开文件，将event写入，在到达某个时间点或者文件达到指定大小，对文件进行重新命名rename操作，然后提交commit事务，若发生异常则回滚事务，否则正常关闭文件即可。这一过程中对文件的操作均是调用HDFS的文件操作方法（open、mkdir、rename等）。在写入hdfs时还有文件压缩、hadoop中副本处理等一些点，这里就讲了。

至此Sink写入hdfs的大致过程结束。

您可能感兴趣的与本文相关的镜像