fume的sink---HDFS Sink

最新推荐文章于 2023-05-05 09:18:05 发布

翻译最新推荐文章于 2023-05-05 09:18:05 发布 · 369 阅读

flume 同时被 3 个专栏收录

9 篇文章

订阅专栏

sink

3 篇文章

订阅专栏

hdfs

2 篇文章

订阅专栏

本文深入探讨了HDFSSink的功能，它是Flume用于将事件数据写入Hadoop分布式文件系统（HDFS）的重要组件。文章详细介绍了如何利用HDFSSink创建文本和序列文件，支持的数据压缩方式，以及文件滚动、数据分桶和格式化escape序列的使用方法。此外，还提供了配置参数的解释和示例，帮助读者理解如何高效地管理和组织HDFS上的数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

HDFS Sink

This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events. Using this sink requires hadoop to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster. Note that a version of Hadoop that supports the sync() call is required.
HDFS Sink将events写入Hadoop的分布式文件系统中(HDFS)。目前支持创建text文本以及序列化文件，两者都支持压缩。文件可以基于运行时间以及数据大小或者events的数量来定期地滚动（关闭当前文件并创建一个新的）。它也可以通过类似timestamp时间戳或数据的机器来源（即数据来源于哪个机器）来将数据进行分桶或分区。HDFS的目录路径可以包含格式化escape序列，该序列将被HDFS sink取代，来生成一个目录/文件名，来保存文件。使用这个sink需要安装好hadoop，flume才能使用hadoop的jar包，从而与HDFS的集群进行通话。注意，hadoop的版本需要支持同步访问。

一、escape 序列支持

The following are the escape sequences supported:

Alias	Description
%{host}	Substitute value of event header named “host”. Arbitrary header names are supported.
%t	Unix time in milliseconds
%a	locale’s short weekday name (Mon, Tue, …)
%A	locale’s full weekday name (Monday, Tuesday, …)
%b	locale’s short month name (Jan, Feb, …)
%B	locale’s long month name (January, February, …)
%c	locale’s date and time (Thu Mar 3 23:05:25 2005)
%d	day of month (01)
%e	day of month without padding (1)
%D	date; same as %m/%d/%y
%H	hour (00…23)
%I	hour (01…12)
%j	day of year (001…366)
%k	hour ( 0…23)
%m	month (01…12)
%n	month without padding (1…12)
%M	minute (00…59)
%p	locale’s equivalent of am or pm
%s	seconds since 1970-01-01 00:00:00 UTC
%S	second (00…59)
%y	last two digits of year (00…99)
%Y	year (2010)
%z	+hhmm numeric timezone (for example, -0400)
%[localhost]	Substitute the hostname of the host where the agent is running
%[IP]	Substitute the IP address of the host where the agent is running
%[FQDN]	Substitute the canonical hostname of the host where the agent is running

Note: The escape strings %[localhost], %[IP] and %[FQDN] all rely on Java’s ability to obtain the hostname, which may fail in some networking environments.
注意：escape字符串 %[localhost], %[IP] 和 %[FQDN]都依赖java来获取主机名，在某些网络环境中可能失效。
The file in use will have the name mangled to include ”.tmp” at the end. Once the file is closed, this extension is removed. This allows excluding partially complete files in the directory. Required properties are in bold.
有些使用中的文件的文件名结尾会包含“.tmp”，一旦文件关闭，就会被删除。
Note For all of the time related escape sequences, a header with the key “timestamp” must exist among the headers of the event (unless hdfs.useLocalTimeStamp is set to true). One way to add this automatically is to use the Timestamp Interceptor.
记录所有与时间相关的escape sequences时，“timestamp”需要作为event的头（除非hdfs.useLocalTimeStamp被设置为true）。可以通过使用时间戳拦截器的方法实现自动添加。

Name	Default	Description
channel	–
type	–	The component type name, needs to be hdfs
hdfs.path	–	HDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefix	FlumeData	Name prefixed to files created by Flume in hdfs directory
hdfs.fileSuffix	–	Suffix to append to file (eg .avro - NOTE: period is not automatically added)
hdfs.inUsePrefix	–	Prefix that is used for temporal files that flume actively writes into
hdfs.inUseSuffix	.tmp	Suffix that is used for temporal files that flume actively writes into
hdfs.rollInterval	30	Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize	1024	File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount	10	Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout	0	Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize	100	number of events written to file before it is flushed to HDFS
hdfs.codeC	–	Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
hdfs.fileType	SequenceFile	File format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
hdfs.maxOpenFiles	5000	Allow only this number of open files. If this number is exceeded, the oldest file is closed.
hdfs.minBlockReplicas	–	Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
hdfs.writeFormat	Writable	Format for sequence file records. One of Text or Writable. Set to Text before creating data files with Flume, otherwise those files cannot be read by either Apache Impala (incubating) or Apache Hive.
hdfs.callTimeout	10000	Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.
hdfs.threadsPoolSize	10	Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
hdfs.rollTimerPoolSize	1	Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipal	–	Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab	–	Kerberos keytab for accessing secure HDFS
hdfs.proxyUser
hdfs.round	false	Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)
hdfs.roundValue	1	Rounded down to the highest multiple of this (in the unit configured using hdfs.roundUnit), less than current time.
hdfs.roundUnit	second	The unit of the round down value - second, minute or hour.
hdfs.timeZone	Local Time	Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp	false	Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
hdfs.closeTries	0	Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval	180	Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializer	TEXT	Other possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.
serializer.*

以名为a1的agent为例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

The above configuration will round down the timestamp to the last 10th minute. For example, an event with timestamp 11:54:34 AM, June 12, 2012 will cause the hdfs path to become /flume/events/2012-06-12/1150/00.
以上配置将记录时间戳，精确（四舍五入）到最后的10分钟
例如一个event的时间戳是11:54:34 AM, June 12, 2012 ，其在hdfs中的路径名将为/flume/events/2012-06-12/1150/00