Data Format: SquenceFiles

最新推荐文章于 2025-08-10 16:33:07 发布

原创最新推荐文章于 2025-08-10 16:33:07 发布 · 359 阅读

CC 4.0 BY-SA版权

文章标签：

4 篇文章

订阅专栏

1 篇文章

订阅专栏

本文探讨了Hadoop中SequenceFile解决小文件问题的方法。通过将多个小文件整合到一个SequenceFile中，不仅可以减少NameNode内存负担，还能提高MapReduce作业效率。SequenceFile支持压缩，并且可以选择记录压缩或块压缩来进一步节省存储空间。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.

SequenceFile is one of the solutions to small file problem in Hadoop.
Small file is significantly smaller than the HDFS block size(128MB).
Each file, directory, block in HDFS is represented as object and occupies 150 bytes.
10 million files, would use about 3 gigabytes of memory of NameNode.
A billion files is not feasible.

Map tasks usually process a block of input at a time (using the default FileInputFormat).
The more the number of files is, the more number of Map task need and the job time can be much more slower.

These two cases require different solutions.

For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)
For the second one, some kind of container is needed to group the files in some way.

HAR files

HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.
HARs are probably best used purely for archival purposes.

SequenceFile

The concept of SequenceFile is to put each small file to a larger single file.
For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.
Some benefits:
1. A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
  - Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.
  - After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode.
2. SequenceFile is splittable, so is suitable for MapReduce.
3. SequenceFile is compression supported.
Supported Compressions, the file structure depends on the compression type.
1. Uncompressed
2. Record-Compressed: Compresses each record as it’s added to the file.
3. Block-Compressed
  - Waits until data reaches block size to compress.
  - Block compression provide better compression ratio than Record compression.
  - Block compression is generally the preferred option when using SequenceFile.
  - Block here is unrelated to HDFS or filesystem block.