关于Flume异常情况导致的数据重复写入问题分析

最新推荐文章于 2024-10-13 14:12:46 发布

原创

最新推荐文章于 2024-10-13 14:12:46 发布 · 4.1k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#flume

环境

flume-ng 1.6.0-cdh5.15.1

问题描述

通过flume抽取kafka数据, 落地HDFS. source与channel不在本次问题分析范围内,暂且忽略. sink的部分配置如下:

tier1.sinks.sink1.type=hdfs
tier1.sinks.sink1.channel=channel1
tier1.sinks.sink1.hdfs.path=hdfs://xxx/xxx/day=%Y%m%d
tier1.sinks.sink1.hdfs.filePrefix=xxx.log
tier1.sinks.sink1.hdfs.fileType=DataStream
tier1.sinks.sink1.hdfs.closeTries=3
tier1.sinks.sink1.hdfs.round=true
tier1.sinks.sink1.hdfs.roundValue=10
tier1.sinks.sink1.hdfs.roundUnit=minute
tier1.sinks.sink1.hdfs.rollSize=128000000
tier1.sinks.sink1.hdfs.rollCount=0
tier1.sinks.sink1.hdfs.rollInterval=0
tier1.sinks.sink1.hdfs.idleTimeout=1800
tier1.sinks.sink1.hdfs.callTimeout=7200000
tier1.sinks.sink1.hdfs.threadsPoolSize=10
tier1.sinks.sink1.hdfs.rollTimerPoolSize=10
tier1.sinks.sink1.hdfs.batchSize=10000

由于线上的一个文件夹权限被改动, 导致flume无法向该文件夹写入数据, 并且向上一个文件夹重复写入数据.
flume问题-1
从上图也可以看出, flume没有正常的继续从kafka中消费数据, 并通过检查kafka的offset验证

问题分析

首先我们看下正常情况下, flume的HDFS sink会怎样处理消费的数据. 下面是HDFSEventSink类的process()代码, 我们还需要关注一下sfWriters这个成员变量:

  private final Object sfWritersLock = new Object();
  private WriterLinkedHashMap sfWriters;
/**
   * Pull events out of channel and send it to HDFS. Take at most batchSize
   * events per Transaction. Find the corresponding bucket for the event.
   * Ensure the file is open. Serialize the data and write it to the file on
   * HDFS. <br/>
   * This method is not thread safe.
   */
  public Status process() throws EventDeliveryException {
   
   
    Channel channel = getChannel();
    Transaction transaction = channel.getTransaction();
    List<BucketWriter> writers = Lists.newArrayList();
    transaction.begin();
    try {
   
   
      int txnEventCount = 0;
      for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
   
   
        Event event = channel.take();
        if (event == null) {
   
   
          break;
        }

        // reconstruct the path name by substituting place holders
        String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
            timeZone, needRounding, roundUnit, roundValue, useLocalTime);
        String realName = BucketPath.escapeString(fileName, event.getHeaders(),
          timeZone, needRounding, roundUnit, roundValue, useLocalTime);

        String lookupPath = realPath + DIRECTORY_DELIMITER + realName;
        BucketWriter bucketWriter;
        HDFSWriter hdfsWriter = null;
        // Callback to remove the reference to the bucket writer from the
        // sfWriters map so that all buffers used by the HDFS file
        // handles are garbage collected.
        WriterCallba