Data Format: SquenceFiles

本文探讨了Hadoop中SequenceFile解决小文件问题的方法。通过将多个小文件整合到一个SequenceFile中,不仅可以减少NameNode内存负担,还能提高MapReduce作业效率。SequenceFile支持压缩,并且可以选择记录压缩或块压缩来进一步节省存储空间。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.

In HDFS

  • SequenceFile is one of the solutions to small file problem in Hadoop.
  • Small file is significantly smaller than the HDFS block size(128MB).
  • Each file, directory, block in HDFS is represented as object and occupies 150 bytes.
  • 10 million files, would use about 3 gigabytes of memory of NameNode.
  • A billion files is not feasible.

In MapReduce

  • Map tasks usually process a block of input at a time (using the default FileInputFormat).

  • The more the number of files is, the more number of Map task need and the job time can be much more slower.

Small file scenarios

  • The files are pieces of a larger logical file.
  • The files are inherently small, for example, images.

These two cases require different solutions.

  • For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)
  • For the second one, some kind of container is needed to group the files in some way.

Solutions in Hadoop

HAR files

  • HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.
  • HARs are probably best used purely for archival purposes.

SequenceFile

  • The concept of SequenceFile is to put each small file to a larger single file.
  • For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.

    SequenceFile File Layout

  • Some benefits:

    1. A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
      • Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.
      • After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode.
    2. SequenceFile is splittable, so is suitable for MapReduce.
    3. SequenceFile is compression supported.
  • Supported Compressions, the file structure depends on the compression type.

    1. Uncompressed
    2. Record-Compressed: Compresses each record as it’s added to the file.
      record_compress_seq

    3. Block-Compressed
      这里写图片描述

      • Waits until data reaches block size to compress.
      • Block compression provide better compression ratio than Record compression.
      • Block compression is generally the preferred option when using SequenceFile.
      • Block here is unrelated to HDFS or filesystem block.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值