fume的sink---HDFS Sink

本文深入探讨了HDFSSink的功能,它是Flume用于将事件数据写入Hadoop分布式文件系统(HDFS)的重要组件。文章详细介绍了如何利用HDFSSink创建文本和序列文件,支持的数据压缩方式,以及文件滚动、数据分桶和格式化escape序列的使用方法。此外,还提供了配置参数的解释和示例,帮助读者理解如何高效地管理和组织HDFS上的数据。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

HDFS Sink

This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events. Using this sink requires hadoop to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster. Note that a version of Hadoop that supports the sync() call is required.
HDFS Sink将events写入Hadoop的分布式文件系统中(HDFS)。目前支持创建text文本以及序列化文件,两者都支持压缩。文件可以基于运行时间以及数据大小或者events的数量来定期地滚动(关闭当前文件并创建一个新的)。它也可以通过类似timestamp时间戳或数据的机器来源(即数据来源于哪个机器)来将数据进行分桶或分区。HDFS的目录路径可以包含格式化escape序列,该序列将被HDFS sink取代,来生成一个目录/文件 名,来保存文件。使用这个sink需要安装好hadoop,flume才能使用hadoop的jar包,从而与HDFS的集群进行通话。注意,hadoop的版本需要支持同步访问。

一、escape 序列支持

The following are the escape sequences supported:

AliasDescription
%{host}Substitute value of event header named “host”. Arbitrary header names are supported.
%tUnix time in milliseconds
%alocale’s short weekday name (Mon, Tue, …)
%Alocale’s full weekday name (Monday, Tuesday, …)
%blocale’s short month name (Jan, Feb, …)
%Blocale’s long month name (January, February, …)
%clocale’s date and time (Thu Mar 3 23:05:25 2005)
%dday of month (01)
%eday of month without padding (1)
%Ddate; same as %m/%d/%y
%Hhour (00…23)
%Ihour (01…12)
%jday of year (001…366)
%khour ( 0…23)
%mmonth (01…12)
%nmonth without padding (1…12)
%Mminute (00…59)
%plocale’s equivalent of am or pm
%sseconds since 1970-01-01 00:00:00 UTC
%Ssecond (00…59)
%ylast two digits of year (00…99)
%Yyear (2010)
%z+hhmm numeric timezone (for example, -0400)
%[localhost]Substitute the hostname of the host where the agent is running
%[IP]Substitute the IP address of the host where the agent is running
%[FQDN]Substitute the canonical hostname of the host where the agent is running

Note: The escape strings %[localhost], %[IP] and %[FQDN] all rely on Java’s ability to obtain the hostname, which may fail in some networking environments.
注意:escape字符串 %[localhost], %[IP] 和 %[FQDN]都依赖java来获取主机名,在某些网络环境中可能失效。
The file in use will have the name mangled to include ”.tmp” at the end. Once the file is closed, this extension is removed. This allows excluding partially complete files in the directory. Required properties are in bold.
有些使用中的文件的文件名结尾会包含“.tmp”,一旦文件关闭,就会被删除。
Note For all of the time related escape sequences, a header with the key “timestamp” must exist among the headers of the event (unless hdfs.useLocalTimeStamp is set to true). One way to add this automatically is to use the Timestamp Interceptor.
记录所有与时间相关的escape sequences时,“timestamp”需要作为event的头(除非hdfs.useLocalTimeStamp被设置为true)。可以通过使用时间戳拦截器的方法实现自动添加。

NameDefaultDescription
channel
typeThe component type name, needs to be hdfs
hdfs.pathHDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefixFlumeDataName prefixed to files created by Flume in hdfs directory
hdfs.fileSuffixSuffix to append to file (eg .avro - NOTE: period is not automatically added)
hdfs.inUsePrefixPrefix that is used for temporal files that flume actively writes into
hdfs.inUseSuffix.tmpSuffix that is used for temporal files that flume actively writes into
hdfs.rollInterval30Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize1024File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount10Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout0Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize100number of events written to file before it is flushed to HDFS
hdfs.codeCCompression codec. one of following : gzip, bzip2, lzo, lzop, snappy
hdfs.fileTypeSequenceFileFile format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
hdfs.maxOpenFiles5000Allow only this number of open files. If this number is exceeded, the oldest file is closed.
hdfs.minBlockReplicasSpecify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
hdfs.writeFormatWritableFormat for sequence file records. One of Text or Writable. Set to Text before creating data files with Flume, otherwise those files cannot be read by either Apache Impala (incubating) or Apache Hive.
hdfs.callTimeout10000Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.
hdfs.threadsPoolSize10Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
hdfs.rollTimerPoolSize1Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipalKerberos user principal for accessing secure HDFS
hdfs.kerberosKeytabKerberos keytab for accessing secure HDFS
hdfs.proxyUser
hdfs.roundfalseShould the timestamp be rounded down (if true, affects all time based escape sequences except %t)
hdfs.roundValue1Rounded down to the highest multiple of this (in the unit configured using hdfs.roundUnit), less than current time.
hdfs.roundUnitsecondThe unit of the round down value - second, minute or hour.
hdfs.timeZoneLocal TimeName of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStampfalseUse the local time (instead of the timestamp from the event header) while replacing the escape sequences.
hdfs.closeTries0Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval180Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializerTEXTOther possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.
serializer.*

以名为a1的agent为例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

The above configuration will round down the timestamp to the last 10th minute. For example, an event with timestamp 11:54:34 AM, June 12, 2012 will cause the hdfs path to become /flume/events/2012-06-12/1150/00.
以上配置将记录时间戳,精确(四舍五入)到最后的10分钟
例如一个event的时间戳是11:54:34 AM, June 12, 2012 ,其在hdfs中的路径名将为/flume/events/2012-06-12/1150/00

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值