概述
前边分析了flume的 Source 和 MemoryChannel 两个组件,接下来分析下第三个大组件 Sink。Sink组件主要用于从Channel 中拉取数据至下一个flume agent 或者目的存储对象(如HDFS)。
要分析Sink,就来先看下Sink接口的定义:
public interface Sink extends LifecycleAware, NamedComponent {/*** 设置Channel*/public void setChannel(Channel channel);/*** 返回具体sink的channel*/public Channel getChannel();/*** 请求的Sink尝试从连接的Channel消费数据data。这个方法应该在一个事务范围内从Channel消费。* 成功分发事务应该被提交。失败应该回退。* 如果有 1个或多个Event被成功分发,则为READY;* 如果没有数据从Channel取回放至sink,则为BACKOFF* 在任何类型的故障传递数据到下一跳目的地的情况下,抛出异常EventDeliveryException*/public Status process() throws EventDeliveryException;public static enum Status {READY, BACKOFF}}
从入口Application类出发
从前边的分析可知,flume系统的入口为 Application 类,在该类中会依次启动 Channel、Sink、Source三个组件。从启动代码分析可以发现Sink组件启动调用对象 Sink运行器SinkRunner的start方法,该操作发生在MonitorRunnable线程中调用lifecycleAware.start()方法启动一个Sink组件,eclipse查看该方法具体Sink相关实现如下:
@Overridepublic void start() {SinkProcessor policy = getPolicy(); //获取policy.start();runner = new PollingRunner();runner.policy = policy;runner.counterGroup = counterGroup;runner.shouldStop = new AtomicBoolean(); //以原子方式创建 Boolean值,默认为falserunnerThread = new Thread(runner);runnerThread.setName("SinkRunner-PollingRunner-" +policy.getClass().getSimpleName());runnerThread.start(); //启动线程lifecycleState = LifecycleState.START;}
在该start方法中首先会获取一个SinkProcessor,指定线程的属性(policy、counterGroup 、shouldStop )并且启动它。然后会创建一个线程PollingRunner,调用线程的run方法:
@Overridepublic void run() {logger.debug("Polling sink runner starting");while (!shouldStop.get()) {try {if (policy.process().equals(Sink.Status.BACKOFF)) {counterGroup.incrementAndGet("runner.backoffs");Thread.sleep(Math.min(counterGroup.incrementAndGet("runner.backoffs.consecutive")* backoffSleepIncrement, maxBackoffSleep));} else {counterGroup.set("runner.backoffs.consecutive", 0L);}} catch (InterruptedException e) {......}}logger.debug("Polling runner exiting. Metrics:{}", counterGroup);}}
在run方法中可以发现使用while循环(直到设置shouldStop为true结束循环)调用SinkProcessor中的process方法进行下一步的处理。
Sink处理器
SinkProcessor就是Sink处理器,那么SinkRunner运行器和SinlkProcessor处理器有什么不同呢?其实SinkRunner实际上主要就是运行Sink的(Sink启动入口首先就是调用该对象,相比于Source也有其SourceRunner),而 SinkProcessor 决定究竟哪个 Sink 应该从自己对应的 Channel 中拉取事件。
为什么需要SinkProcessor呢?
Flume可以聚合线程到Sink组,每个Sink组可以包含一个或多个Sink,如果一个Sink没有定义Sink组,那么该Sink可以被认为是在一个组内,且该Sink是组内的唯一成员。Flume会为每一个Sink组实例化一个SinkRunner运行器,来运行该 Sink 组。如下Sink组件框架图:
了解了Sink处理器,接下来查看下都有哪些SinkProcessor,如下图有两种实现,
-
- 基于抽象类AbstractSinkProcessor实现:
实现的子类有FailoverSinkProcessor和LoadBalancingSinkProcessor,适用于配置有Sink组的情况;FailoverSinkProcessor是故障转移处理器,从Sink组中以优先级的顺序选择Sink,直至失败再选择组中第二优先级高的Sink处理;LoadBalancingSinkProcessor是负载均衡处理器,Sink选择顺序支持Random(随机)或者Round-robin(轮询)。
- 基于抽象类AbstractSinkProcessor实现:
-
2.是flume系统默认的Sink处理器类DefaultSinkProcessor,只接受一个单一的Sink,没有任何额外的处理(相比于第一种)传递process的处理结果。
若没有配置Sink组,采用的默认就是DefaultSinkProcessor类中的process,该方法中因为不需要做任何的额外处理,代码也是十分的简单,直接调用Sink 的process方法(也就是配置中具体定义的sink,比如写入HDFS中,那就是调用hdfsSink):
@Overridepublic Status process() throws EventDeliveryException {return sink.process();}
HDFSEventSink process方法
HDFSEventSink 的process方法是Sink组件的核心代码,其中实现了Sink的event事务处理。每一种具体的sink都必须实现process方法,目前1.7版本自带如下:
在HDFSEventSink.java中
/*** 从channel拉数据发送到HDFS。每个事务可以取出batchSize个events.* 找到event对应的存储桶bucket。确保文件打开。序列化数据写入到HDFS上的文件中。* 这个方法不是线程安全的。*/public Status process() throws EventDeliveryException {// 获取管道channelChannel channel = getChannel();Transaction transaction = channel.getTransaction(); //getTransaction获取或创建事务TransactionList<BucketWriter> writers = Lists.newArrayList();transaction.begin(); //事务开始try {int txnEventCount = 0;//从channel中取出batchSize个eventfor (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {Event event = channel.take();if (event == null) {break;}// reconstruct the path name by substituting place holders// 通过替换占位符重建路径名String realPath = BucketPath.escapeString(filePath, event.getHeaders(),timeZone, needRounding, roundUnit, roundValue, useLocalTime);String realName = BucketPath.escapeString(fileName, event.getHeaders(),timeZone, needRounding, roundUnit, roundValue, useLocalTime);String lookupPath = realPath + DIRECTORY_DELIMITER + realName;LOG.debug("realPath:"+realPath+" ; realName: "+realName);LOG.debug("lookupPath: "+lookupPath);/* filePath配置: hdfs.path = /user/portal/tmp/syx/flume-events/%y-%m-%d/%H%M* filePrefix配置: hdfs.filePrefix = events** 添加调试输出如下:* realPath: /user/portal/tmp/syx/flume-events/16-12-17/2110 ; realName: events* lookupPath: /user/portal/tmp/syx/flume-events/16-12-17/2110/events*/BucketWriter bucketWriter;HDFSWriter hdfsWriter = null;// Callback to remove the reference to the bucket writer from the// sfWriters map so that all buffers used by the HDFS file// handles are garbage collected./** 构造一个回调函数,回调函数从sfWriters map中移除对bucket写入器的引用,* 以便HDFS文件句柄使用的所有缓冲区被垃圾回收gc*/WriterCallback closeCallback = new WriterCallback() {@Overridepublic void run(String bucketPath) {LOG.info("Writer callback called.");synchronized (sfWritersLock) {sfWriters.remove(bucketPath); //从sfWriters映射中移除指定键bucketPath的映射关系}}};synchronized (sfWritersLock) {bucketWriter = sfWriters.get(lookupPath);// we haven't seen this file yet, so open it and cache the handle// 我们还没有看到这个文件,所以打开它并缓存句柄if (bucketWriter == null) {//根据fileType选择对应的HDFSWriter,3种:SequenceFile, DataStream or CompressedStreamhdfsWriter = writerFactory.getWriter(fileType);//initializeBucketWriter方法如其名,初始化BucketWriterbucketWriter = initializeBucketWriter(realPath, realName,lookupPath, hdfsWriter, closeCallback);sfWriters.put(lookupPath, bucketWriter);}}// track the buckets getting written in this transaction 跟踪在事务中写入的bucketsif (!writers.contains(bucketWriter)) {writers.add(bucketWriter);}//写数据到HDFStry {bucketWriter.append(event);} catch (BucketClosedException ex) {LOG.info("Bucket was closed while trying to append, " +"reinitializing bucket and writing event.");hdfsWriter = writerFactory.getWriter(fileType);bucketWriter = initializeBucketWriter(realPath, realName,lookupPath, hdfsWriter, closeCallback); //根据传入参数创建BucketWriter对象synchronized (sfWritersLock) {sfWriters.put(lookupPath, bucketWriter);}bucketWriter.append(event);}}if (txnEventCount == 0) {sinkCounter.incrementBatchEmptyCount();} else if (txnEventCount == batchSize) {sinkCounter.incrementBatchCompleteCount();} else {sinkCounter.incrementBatchUnderflowCount();}// flush all pending buckets before committing the transaction//在提交事务前flush所有的文件for (BucketWriter bucketWriter : writers) {bucketWriter.flush();}transaction.commit(); //提交事务if (txnEventCount < 1) {return Status.BACKOFF;} else {sinkCounter.addToEventDrainSuccessCount(txnEventCount);return Status.READY;}} catch (IOException eIO) {transaction.rollback(); //发生异常,回滚事务LOG.warn("HDFS IO error", eIO);return Status.BACKOFF;} catch (Throwable th) {transaction.rollback();LOG.error("process failed", th);if (th instanceof Error) {throw (Error) th;} else {throw new EventDeliveryException(th);}} finally {transaction.close(); //关闭事务}}
这个process方法中主要实现功能就是event的事务处理,开始一个事务transaction,然后从对应的Channel中take出event,写入HDFS中,写入的过程是open根据配置文件指定的路径名称在对应目录中生成对应的临时文件(默认以.tmp为后缀)打开文件,将event写入,在到达某个时间点或者文件达到指定大小,对文件进行重新命名rename操作,然后提交commit事务,若发生异常则回滚事务,否则正常关闭文件即可。这一过程中对文件的操作均是调用HDFS的文件操作方法(open、mkdir、rename等)。在写入hdfs时还有文件压缩、hadoop中副本处理等一些点,这里就讲了。
至此Sink写入hdfs的大致过程结束。
本文深入分析了Flume的Sink组件,特别是HDFSEventSink的处理流程。从Application类的启动开始,讲解了SinkRunner和SinkProcessor的作用,解释了为何需要SinkProcessor来管理和调度Sink组。接着详细阐述了HDFSEventSink的process方法,该方法处理事件事务,包括从Channel获取事件、写入HDFS以及事务的提交或回滚。整个过程涉及到HDFS的文件操作,如创建临时文件、重命名及文件压缩等。
306

被折叠的 条评论
为什么被折叠?



