Flume NG flume-hdfs-sink 源代码分析

最新推荐文章于 2022-04-05 22:57:53 发布

原创最新推荐文章于 2022-04-05 22:57:53 发布 · 2.3k 阅读

0 ·

CC 4.0 BY-SA版权

Big Data 专栏收录该内容

18 篇文章

订阅专栏

本文详细介绍了HDFSEventSink组件的配置、启动、处理及停止方法，包括从context中读取配置参数、设置编码、写入格式以及与不同格式的文件交互流程，并针对FLUME-1104问题提出解决方案。

C1: HDFSEventSink

0. HDFSEventSink.configure() also needs to implement a Configurable interface for processing its own configuration settings.

0.1 从context中读取配置参数configure；

0.2 设置编码,

codeC = getCodec(codecName);
// TODO : set proper compression type
compType = CompressionType.BLOCK;

0.2.1 getCodec()

(1) 通过 CompressionCodecFactory.getCodecClasses(conf); 获取所能兼容的编码类型codecs

(2) 通过codecMatches(cls, codecName)判断是否相等，以获取编码名codecName所对应的编码类；

(3) 获取codec = cls.newInstance()，

(4)

0.3 set writeFormat

if writeFormat = null,

then set format according to file type, if fileType= DataStreamType or CompStreamType, set

1. HDFSEventSink.start() method should initialize the sink and bring it to a state where it can forward the events to its next destination.

2. HDFSEventSink.process() method from sink interface is should do the core processing of extracting the event from channel and forwarding it.

3. HDFSEventSink.stop() method should do the necessary cleanup.

HDFSEventSink will call

(2) HDFSFormatterFactory

(2.1) HDFSWriterableFormatter

(2.2) HDFSTextFormatter

(3) HDFSWriterFactory

(3.1) HDFSSequenceFile

(3.2) HDFSDataStream

(3.3) HDFSCompressDataStream

(4) BucketWriter

(5) HDFSWriter

FLUME-1104 : HDFS rolls the first file incorrectly

The sink process() keep tracks of the buckets opened during the transaction. At the end of transaction, we need to flush all the buckets that has pending data. This is required in order to ensure that the data removed from channel should be safely in HDFS during commit.
Currently the files are tracked only when they are created and also getting closed during the cleanup instead of flush.

The fix is to track buckets every time they are written to in the current transaction. Also buckets with pending data should be flushed instead of close.