flume SpoolDirectorySource二次开发新增文件监控

博客介绍了如何分析Apache Flume的SpoolDirectorySource源码,以实现仅获取文件绝对路径而不是文件内容的功能。通过配置Flume的SpoolDirectorySource并修改源码,创建一个监控新文件并读取文件路径的实现。文章提供了一个配置示例,并展示了关键代码段,包括读取事件和替换文件内容为文件路径的部分。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

背景

目前部分数据来源于ftp服务,为了提升数据入库的操作,打算结合flume的实时采集。发现SpoolDirectorySource这个自带的source可以监控新增的文件。但是有个问题就是它的输出是文件的内容,但是我这边只需要知道文件的绝对路径就行。所以打算剥开它的源码,来瞅一瞅。

源码分析

源码如下

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with this
 * work for additional information regarding copyright ownership. The ASF
 * licenses this file to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */

package unicom.cn.source.spooldir;

import com.google.common.annotations.VisibleForTesting;
import com.google.common.base.Preconditions;
import com.google.common.base.Throwables;
import org.apache.flume.*;
import org.apache.flume.client.avro.ReliableSpoolingFileEventReader;
import org.apache.flume.conf.BatchSizeSupported;
import org.apache.flume.conf.Configurable;
import org.apache.flume.instrumentation.SourceCounter;
import org.apache.flume.serialization.DecodeErrorPolicy;
import org.apache.flume.serialization.LineDeserializer;
import org.apache.flume.source.AbstractSource;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Locale;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

import static org.apache.flume.source.SpoolDirectorySourceConfigurationConstants.*;

public class SpoolDirectorySource extends AbstractSource
    implements Configurable, EventDrivenSource, BatchSizeSupported {

  private static final Logger logger = LoggerFactory.getLogger(SpoolDirectorySource.class);

  /* Config options */
  private String completedSuffix;
  private String spoolDirectory;
  private boolean fileHeader;
  private String fileHeaderKey;
  private boolean basenameHeader;
  private String basenameHeaderKey;
  private int batchSize;
  private String includePattern;
  private String ignorePattern;
  private String trackerDirPath;
  private String deserializerType;
  private Context deserializerContext;
  private String deletePolicy;
  private String inputCharset;
  private DecodeErrorPolicy decodeErrorPolicy;
  private volatile boolean hasFatalError = false;

  private SourceCounter sourceCounter;
  ReliableSpoolingFileEventReader reader;
  private ScheduledExecutorService executor;
  private boolean backoff = true;
  private boolean hitChannelException = false;
  private boolean hitChannelFullException = false;
  private int maxBackoff;
  private ConsumeOrder consumeOrder;
  private int pollDelay;
  private boolean recursiveDirectorySearch;
  private String trackingPolicy;

  @Override
  public synchronized void start() {
    logger.info("SpoolDirectorySource source starting with directory: {}",
        spoolDirectory);

    executor = Executors.newSingleThreadScheduledExecutor();

    File directory = new File(spoolDirectory);
    try {
      reader = new ReliableSpoolingFileEventReader.Builder()
          .spoolDirectory(directory)
          .completedSuffix(completedSuffix)
          .includePattern(includePattern)
          .ignorePattern(ignorePattern)
          .trackerDirPath(trackerDirPath)
          .annotateFileName(fileHeader)
          .fileNameHeader(fileHeaderKey)
          .annotateBaseName(basenameHeader)
          .baseNameHeader(basenameHeaderKey)
          .deserializerType(deserializerType)
          .deserializerContext(deserializerContext)
          .deletePolicy(deletePolicy)
          .inputCharset(inputCharset)
          .decodeErrorPolicy(decodeErrorPolicy)
          .consumeOrder(consumeOrder)
          .recursiveDirectorySearch(recursiveDirectorySearch)
          .trackingPolicy(trackingPolicy)
          .sourceCounter(sourceCounter)
          .build();
    } catch (IOException ioe) {
      throw new FlumeException("Error instantiating spooling event parser",
          ioe);
    }

    Runnable runner = new SpoolDirectoryRunnable(reader, sourceCounter);
    executor.scheduleWithFixedDelay(
        runner, 0, pollDelay, TimeUnit.MILLISECONDS);

    super.start();
    logger.debug("SpoolDirectorySource source started");
    sourceCounter.start();
  }

  @Override
  public synchronized void stop() {
    executor.shutdown();
    try {
      executor.awaitTermination(10L, TimeUnit.SECONDS);
    } catch (InterruptedException ex) {
      logger.info("Interrupted while awaiting termination", ex);
    }
    executor.shutdownNow();

    super.stop();
    sourceCounter.stop();
    logger.info("SpoolDir source {} stopped. Metrics: {}", getName(), sourceCounter);
  }

  @Override
  public String toString() {
    return "Spool Directory source " + getName() +
        ": { spoolDir: " + spoolDirectory + " }";
  }

  @Override
  public synchronized void configure(Context context) {
    spoolDirectory = context.getString(SPOOL_DIRECTORY);
    Preconditions.checkState(spoolDirectory != null,
        "Configuration must specify a spooling directory");

    completedSuffix = context.getString(SPOOLED_FILE_SUFFIX,
        DEFAULT_SPOOLED_FILE_SUFFIX);
    deletePolicy = context.getString(DELETE_POLICY, DEFAULT_DELETE_POLICY);
    fileHeader = context.getBoolean(FILENAME_HEADER,
        DEFAULT_FILE_HEADER);
    fileHeaderKey = context.getString(FILENAME_HEADER_KEY,
        DEFAULT_FILENAME_HEADER_KEY);
    basenameHeader = context.getBoolean(BASENAME_HEADER,
        DEFAULT_BASENAME_HEADER);
    basenameHeaderKey = context.getString(BASENAME_HEADER_KEY,
        DEFAULT_BASENAME_HEADER_KEY);
    batchSize = context.getInteger(BATCH_SIZE,
        DEFAULT_BATCH_SIZE);
    inputCharset = context.getString(INPUT_CHARSET, DEFAULT_INPUT_CHARSET);
    decodeErrorPolicy = DecodeErrorPolicy.valueOf(
        context.getString(DECODE_ERROR_POLICY, DEFAULT_DECODE_ERROR_POLICY)
            .toUpperCase(Locale.ENGLISH));

    includePattern = context.getString(INCLUDE_PAT, DEFAULT_INCLUDE_PAT);
    ignorePattern = context.getString(IGNORE_PAT, DEFAULT_IGNORE_PAT);
    trackerDirPath = context.getString(TRACKER_DIR, DEFAULT_TRACKER_DIR);

    deserializerType = context.getString(DESERIALIZER, DEFAULT_DESERIALIZER);
    deserializerContext = new Context(context.getSubProperties(DESERIALIZER +
        "."));

    consumeOrder = ConsumeOrder.valueOf(context.getString(CONSUME_ORDER,
        DEFAULT_CONSUME_ORDER.toString()).toUpperCase(Locale.ENGLISH));

    pollDelay = context.getInteger(POLL_DELAY, DEFAULT_POLL_DELAY);

    recursiveDirectorySearch = context.getBoolean(RECURSIVE_DIRECTORY_SEARCH,
        DEFAULT_RECURSIVE_DIRECTORY_SEARCH);

    // "Hack" to support backwards compatibility with previous generation of
    // spooling directory source, which did not support deserializers
    Integer bufferMaxLineLength = context.getInteger(BUFFER_MAX_LINE_LENGTH);
    if (bufferMaxLineLength != null && deserializerType != null &&
        deserializerType.equalsIgnoreCase(DEFAULT_DESERIALIZER)) {
      deserializerContext.put(LineDeserializer.MAXLINE_KEY,
          bufferMaxLineLength.toString());
    }

    maxBackoff = context.getInteger(MAX_BACKOFF, DEFAULT_MAX_BACKOFF);
    if (sourceCounter == null) {
      sourceCounter = new SourceCounter(getName());
    }
    trackingPolicy = context.getString(TRACKING_POLICY, DEFAULT_TRACKING_POLICY);
  }

  @VisibleForTesting
  protected boolean hasFatalError() {
    return hasFatalError;
  }


  /**
   * The class always backs off, this exists only so that we can test without
   * taking a really long time.
   *
   * @param backoff - whether the source should backoff if the channel is full
   */
  @VisibleForTesting
  protected void setBackOff(boolean backoff) {
    this.backoff = backoff;
  }

  @VisibleForTesting
  protected boolean didHitChannelException() {
    return hitChannelException;
  }

  @VisibleForTesting
  protected boolean didHitChannelFullException() {
    return hitChannelFullException;
  }

  @VisibleForTesting
  protected SourceCounter getSourceCounter() {
    return sourceCounter;
  }

  @VisibleForTesting
  protected boolean getRecursiveDirectorySearch() {
    return recursiveDirectorySearch;
  }

  @Override
  public long getBatchSize() {
    return batchSize;
  }

  @VisibleForTesting
  protected class SpoolDirectoryRunnable implements Runnable {
    private ReliableSpoolingFileEventReader reader;
    private SourceCounter sourceCounter;

    public SpoolDirectoryRunnable(ReliableSpoolingFileEventReader reader,
                                  SourceCounter sourceCounter) {
      this.reader = reader;
      this.sourceCounter = sourceCounter;
    }

    @Override
    public void run() {
      int backoffInterval = 250;
      boolean readingEvents = false;
      try {
        while (!Thread.interrupted()) {
          readingEvents = true;
          List<Event> events = reader.readEvents(batchSize);
          readingEvents = false;
          if (events.isEmpty()) {
            break;
          }
          sourceCounter.addToEventReceivedCount(events.size());
          sourceCounter.incrementAppendBatchReceivedCount();

          try {
            getChannelProcessor().processEventBatch(events);
            reader.commit();
          } catch (ChannelFullException ex) {
            logger.warn("The channel is full, and cannot write data now. The " +
                "source will try again after " + backoffInterval +
                " milliseconds");
            sourceCounter.incrementChannelWriteFail();
            hitChannelFullException = true;
            backoffInterval = waitAndGetNewBackoffInterval(backoffInterval);
            continue;
          } catch (ChannelException ex) {
            logger.warn("The channel threw an exception, and cannot write data now. The " +
                "source will try again after " + backoffInterval +
                " milliseconds");
            sourceCounter.incrementChannelWriteFail();
            hitChannelException = true;
            backoffInterval = waitAndGetNewBackoffInterval(backoffInterval);
            continue;
          }
          backoffInterval = 250;
          sourceCounter.addToEventAcceptedCount(events.size());
          sourceCounter.incrementAppendBatchAcceptedCount();
        }
      } catch (Throwable t) {
        logger.error("FATAL: " + SpoolDirectorySource.this.toString() + ": " +
            "Uncaught exception in SpoolDirectorySource thread. " +
            "Restart or reconfigure Flume to continue processing.", t);
        if (readingEvents) {
          sourceCounter.incrementEventReadFail();
        } else {
          sourceCounter.incrementGenericProcessingFail();
        }
        hasFatalError = true;
        Throwables.propagate(t);
      }
    }

    private int waitAndGetNewBackoffInterval(int backoffInterval) throws InterruptedException {
      if (backoff) {
        TimeUnit.MILLISECONDS.sleep(backoffInterval);
        backoffInterval = backoffInterval << 1;
        backoffInterval = backoffInterval >= maxBackoff ? maxBackoff :
            backoffInterval;
      }
      return backoffInterval;
    }
  }
}

几个关键的方法

  1. public synchronized void configure(Context context)读取配置信息
  2. public synchronized void start() 监控任务开始

基于官方例子

a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true

我们构造一下配置信息

        spoolDirectory ="D:\\tmp\\test";
        Preconditions.checkState(spoolDirectory != null,
                "Configuration must specify a spooling directory");

        completedSuffix =DEFAULT_SPOOLED_FILE_SUFFIX ;
        deletePolicy =DEFAULT_DELETE_POLICY;
        fileHeader =
                DEFAULT_FILE_HEADER;
        fileHeaderKey =
                DEFAULT_FILENAME_HEADER_KEY;
        basenameHeader =
                DEFAULT_BASENAME_HEADER;
        basenameHeaderKey =
                DEFAULT_BASENAME_HEADER_KEY;
        batchSize =
                DEFAULT_BATCH_SIZE;
        inputCharset = DEFAULT_INPUT_CHARSET;
        decodeErrorPolicy = DecodeErrorPolicy.valueOf(
               DEFAULT_DECODE_ERROR_POLICY
                        .toUpperCase(Locale.ENGLISH));

        includePattern =  DEFAULT_INCLUDE_PAT;
        ignorePattern = DEFAULT_IGNORE_PAT;
        trackerDirPath =  DEFAULT_TRACKER_DIR;

        deserializerType = DEFAULT_DESERIALIZER;
        Map<String, String> paramters=new HashMap<>();
        deserializerContext = new Context(paramters);

        consumeOrder = ConsumeOrder.valueOf(
                DEFAULT_CONSUME_ORDER.toString().toUpperCase(Locale.ENGLISH));

        pollDelay =DEFAULT_POLL_DELAY;

        recursiveDirectorySearch =
                DEFAULT_RECURSIVE_DIRECTORY_SEARCH;

        // "Hack" to support backwards compatibility with previous generation of
        // spooling directory source, which did not support deserializers
//        if (bufferMaxLineLength != null && deserializerType != null &&
//                deserializerType.equalsIgnoreCase(DEFAULT_DESERIALIZER)) {
            deserializerContext.put(LineDeserializer.MAXLINE_KEY,
                   "100");
//        }

        maxBackoff =DEFAULT_MAX_BACKOFF;
        if (sourceCounter == null) {
            sourceCounter = new SourceCounter("test");
        }
        trackingPolicy =DEFAULT_TRACKING_POLICY;


        executor = Executors.newSingleThreadScheduledExecutor();

        File directory = new File(spoolDirectory);

现在进入最关键的时候,监控任务了,核心代码

public List<Event> readEvents(int numEvents) throws IOException {
    if (!committed) {
      if (!currentFile.isPresent()) {
        throw new IllegalStateException("File should not roll when " +
            "commit is outstanding.");
      }
      logger.info("Last read was never committed - resetting mark position.");
      currentFile.get().getDeserializer().reset();
    } else {
      // Check if new files have arrived since last call
      if (!currentFile.isPresent()) {
        currentFile = getNextFile();
      }
      // Return empty list if no new files
      if (!currentFile.isPresent()) {
        return Collections.emptyList();
      }
    }

    List<Event> events = readDeserializerEvents(numEvents);

    /* It's possible that the last read took us just up to a file boundary.
     * If so, try to roll to the next file, if there is one.
     * Loop until events is not empty or there is no next file in case of 0 byte files */
    while (events.isEmpty()) {
      logger.info("Last read took us just up to a file boundary. " +
                  "Rolling to the next file, if there is one.");
      retireCurrentFile();
      currentFile = getNextFile();
      if (!currentFile.isPresent()) {
        return Collections.emptyList();
      }
      events = readDeserializerEvents(numEvents);
    }

    fillHeader(events);

    committed = false;
    lastFileRead = currentFile;
    return events;
  }

  private List<Event> readDeserializerEvents(int numEvents) throws IOException {
    EventDeserializer des = currentFile.get().getDeserializer();
    List<Event> events = des.readEvents(numEvents);
    if (events.isEmpty() && firstTimeRead) {
      events.add(EventBuilder.withBody(new byte[0]));
    }
    firstTimeRead = false;
    return events;
  }

我们需要做的事是,把返回结果event里面的文件内容替换成文件名。替换后的代码

  public List<Event> readEvents(int numEvents) throws IOException {
      if (!currentFile.isPresent()) {
        currentFile = getNextFile();
      }
      // Return empty list if no new files
      if (!currentFile.isPresent()) {
        return Collections.emptyList();
      }
      

    String filename = currentFile.get().getFile().getAbsolutePath();
    List<Event> events =new ArrayList<>();
    events.add(EventBuilder.withBody(filename.getBytes(StandardCharsets.UTF_8)));
    retireCurrentFile();
    currentFile = getNextFile();
    committed = false;
    lastFileRead = currentFile;
    return events;
  }

大功告成
源码包见附件

本关任务:采集目录下的所有文件到 hdfs 。 相关知识 为了完成本关任务,你需要掌握:选择正确的组件和正确的属性采集数据。 下面将带领大家如何把目录下的所有文件采集到控制台; 采集示例 /opt/flume/data目录下有如下文件: sp.txt的文件内容为: hello hello stu.txt的文件内容为: flume flume spark 一:首先在flume安装目录的conf目录下创建配置文件: 注:为了区分配置文件过多造成不易区分,我们通常以 source名-channel名-sink名.properties来命名* 二: 然后编辑该配置文件,根据业务需求配置agent; 第一步选择source,我们的需求是采集目录下的所有文件,因此我们在这里选择Spooling Directory Source; 第二步选择channel:这里我们选择Memory Channel; 第三步选择sink:我们的需求是打印到控制台,因此这里选择Logger Sink。 配置完的配置文件如下: a1代表该agent的名字;r1,c1,k1分别代表该agent的source,channel和 sink的名称;各组件的名称,类型以及 source 与sink和 channel 的绑定是必须配置的,属性集有些是必须要配置的,有些是可选择配置的,可参考本实训后续介绍。 三:配置完后就可以启动flume开始采集文件了: 在 Flume 的安装目录输入如下命令: bin/flume-ng agent -n a1 -c conf -f conf/spooldir-mem-logger.properties -Dflume.root.logger=INFO,console -n后指定配置的 agent 的名称,-f后指定配置文件。 这时候/opt/flume/data目录下所有文件都采集到控制台输出了。 如何选择source,channel,sink,请看以下介绍: 常见 Source Avro Source Avro端口监听并接收来自外部的 avro 客户流的事件。 常用于多个 agent 相连时; | 属性名 | 默认值 |说明| | ------------ | ------------ | |type|-|必须为avro| |channel|-|channel的名称| |bind|-|IP地址或者主机名| |port|-|绑定的端口| 注意:本实训所列举的粗体属性必须配置,其余属性可选择配置;查看全部属性可至官网查看:官网地址。 Exec Source 通过设定一个 Unix(linux) 命令监控文件。例如 cat [named pipe] 或 tail -F [file]。 存在的问题是,当 agent 进程挂掉重启后,会有重复采集的问题。 | 属性名 | 默认值 |说明| | ------------ | ------------ | |type|-|必须为exec| |channel|-|channel的名称| |command|-|执行的命令,常使用 tail -F file| Spooling Directory Source 监控某个目录下新增文件,并读取文件中的数据。采集完的文件默认会被打上标记(在文件末尾加 .COMPLETED)。 适合用于采集新文件,但不适用于对实时追加日志的文件进行监听。如果需要实时监听追加内容的文件,可对 SpoolDirectorySource 进行改进。 注意的点:第一个是采集完的文件追加的新内容不会被采集,第二个是监听目录下的子目录下的文件不会被采集。 | 属性名 | 默认值 |说明| | ------------ | ------------ | |type|-|必须为spooldir| |channel|-|channel的名称| |spoolDir|-|监控目录| Taildir Source 可实时监控多批文件,并记录每个文件最新消费位置,将其保存于一个json 文件中,agent 进程重启后不会有重复采集的问题。 | 属性名 | 默认值 |说明| | ------------ | ------------ | |type|-|必须为logger| |channel|-|channel的名称| |filegroups|-|文件组名称,不同文件组用空格分隔| |filegroups.|-|文件组的绝对路径| |positionFile|~/.flume/taildir_position.json|以json格式记录inode、绝对路径和每个文件的最后消费位置| 常见Sink Logger
最新发布
03-08
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值