The Small Files Problem

本文探讨了Hadoop中小文件带来的问题及其解决方法,包括小文件对HDFS内存和MapReduce任务的影响,以及使用SequenceFile、HAR文件、HBase等技术的应对策略。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topicis anything to go by. In this post I’ll look at the problem, and examine some common solutions.

Problems with small files and HDFS

A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.

Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.

Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.

Problems with small files and MapReduce

Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.

There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasksproperty), and MultiFileInputSplit which can run more than one split per map.

Why are small files produced?

There are at least two cases

  1. The files are pieces of a larger logical file. Since HDFS has only recently supported appends, a very common pattern for saving unbounded files (e.g. log files) is to write them in chunks into HDFS.
  2. The files are inherently small. Imagine a large corpus of images. Each image is a distinct file, and there is no natural way to combine them into one larger file.

These two cases require different solutions. For the first case, where the file is made up of records, the problem may be avoided by calling HDFS’s sync() method (which came with appends, although see this discussion) every so often to continuously write large files. Alternatively, it’s possible to write a program to concatenate the small files together (see Nathan Marz’s post about a tool called the Consolidator which does exactly this).

For the second case, some kind of container is needed to group the files in some way. Hadoop offers a few options here.

HAR files

Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. To a client using the HAR filesystem nothing has changed: all of the original files are visible and accessible (albeit using a har:// URL). However, the number of files in HDFS has been reduced.
SequenceFile and MapFile File Layouts

Reading through files in a HAR is no more efficient than reading through files in HDFS, and in fact may be slower since each HAR file access requires two index file reads as well as the data file read (see diagram). And although HAR files can be used as input to MapReduce, there is no special magic that allows maps to operate over all the files in the HAR co-resident on a HDFS block. It should be possible to built an input format that can take advantage of the improved locality of files in HARs, but it doesn’t exist yet. Note that MultiFileInputSplit, even with the improvements in HADOOP-4565 to choose files in a split that are node local, will need a seek per small file. It would be interesting to see the performance of this compared to a SequenceFile, say. At the current time HARs are probably best used purely for archival purposes.

Sequence Files

The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs.  Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record).

It can be slow to convert existing data into SequenceFiles. However, it is perfectly possible to create a collection of SequenceFiles in parallel. (Stuart Sierra has written a very useful post about converting a tar file into a SequenceFile — tools like this are very useful, and it would be good to see more of them). Going forward it’s best to design your data pipeline to write the data at source direct into a SequenceFile, if possible, rather than writing to small files as an intermediate step.

HAR File Layout
Unlike HAR files there is no way to list all the keys in a SequenceFile, short of reading through the whole file. (MapFiles, which are like SequenceFiles with sorted keys, maintain a partial index, so they can’t list all their keys either — see diagram.)

SequenceFile is rather Java-centric. TFile is designed to be cross-platform, and be a replacement for SequenceFile, but it’s not available yet.

HBase

If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices — see Richard Jones’ excellent survey of key-value stores.

ann@ann:~$ dpkg -l | grep python3 ii libpython3-dev:amd64 3.12.3-0ubuntu2 amd64 header files and a static library for Python (default) ii libpython3-stdlib:amd64 3.12.3-0ubuntu2 amd64 interactive high-level object-oriented language (default python3 version) ii libpython3.10-minimal:amd64 3.10.4-3 amd64 Minimal subset of the Python language (version 3.10) ii libpython3.10-stdlib:amd64 3.10.4-3 amd64 Interactive high-level object-oriented language (standard library, version 3.10) ii libpython3.12-dev:amd64 3.12.3-1ubuntu0.7 amd64 Header files and a static library for Python (v3.12) ii libpython3.12-minimal:amd64 3.12.3-1ubuntu0.7 amd64 Minimal subset of the Python language (version 3.12) ii libpython3.12-stdlib:amd64 3.12.3-1ubuntu0.7 amd64 Interactive high-level object-oriented language (standard library, version 3.12) ii libpython3.12t64:amd64 3.12.3-1ubuntu0.7 amd64 Shared Python runtime library (version 3.12) ii python3 3.12.3-0ubuntu2 amd64 interactive high-level object-oriented language (default python3 version) ii python3-apport 2.28.1-0ubuntu3.7 all Python 3 library for Apport crash report handling ii python3-apt 2.7.7ubuntu4 amd64 Python 3 interface to libapt-pkg ii python3-aptdaemon 1.1.1+bzr982-0ubuntu44 all Python 3 module for the server and client of aptdaemon ii python3-aptdaemon.gtk3widgets 1.1.1+bzr982-0ubuntu44 all Python 3 GTK+ 3 widgets to run an aptdaemon client ii python3-attr 23.2.0-2 all Attributes without boilerplate (Python 3) ii python3-babel 2.10.3-3build1 all tools for internationalizing Python applications - Python 3.x ii python3-bcrypt 3.2.2-1build1 amd64 password hashing library for Python 3 ii python3-blinker 1.7.0-1 all Fast, simple object-to-object and broadcast signaling (Python3) ii python3-bpfcc 0.29.1+ds-1ubuntu7 all Python 3 wrappers for BPF Compiler Collection (BCC) ii python3-brlapi:amd64 6.6-4ubuntu5 amd64 Braille display access via BRLTTY - Python3 bindings ii python3-cairo 1.25.1-2build2 amd64 Python3 bindings for the Cairo vector graphics library ii python3-certifi 2023.11.17-1 all root certificates for validating SSL certs and verifying TLS hosts (python3) ii python3-cffi-backend:amd64 1.16.0-2build1 amd64 Foreign Function Interface for Python 3 calling C code - runtime ii python3-chardet 5.2.0+dfsg-1 all Universal Character Encoding Detector (Python3) ii python3-click 8.1.6-2 all Wrapper around optparse for command line utilities - Python 3.x ii python3-colorama 0.4.6-4 all Cross-platform colored terminal text in Python - Python 3.x ii python3-commandnotfound 23.04.0 all Python 3 bindings for command-not-found. ii python3-configobj 5.0.8-3 all simple but powerful config file reader and writer for Python 3 ii python3-cryptography 41.0.7-4ubuntu0.1 amd64 Python library exposing cryptographic recipes and primitives (Python 3) ii python3-cups:amd64 2.0.1-5build6 amd64 Python3 bindings for CUPS ii python3-cupshelpers 1.5.18-1ubuntu9 all Python utility modules around the CUPS printing system ii python3-dateutil 2.8.2-3ubuntu1 all powerful extensions to the standard Python 3 datetime module ii python3-dbus 1.3.2-5build3 amd64 simple interprocess messaging system (Python 3 interface) ii python3-debconf 1.5.86ubuntu1 all interact with debconf from Python 3 ii python3-debian 0.1.49ubuntu2 all Python 3 modules to work with Debian-related data formats ii python3-defer 1.0.6-2.1ubuntu1 all Small framework for asynchronous programming (Python 3) ii python3-dev 3.12.3-0ubuntu2 amd64 header files and a static library for Python (default) ii python3-distro 1.9.0-1 all Linux OS platform information API ii python3-distro-info 1.7build1 all information about distributions' releases (Python 3 module) ii python3-distupgrade 1:24.04.26 all manage release upgrades ii python3-fasteners 0.18-2 all provides useful locks - Python 3.x ii python3-gdbm:amd64 3.12.3-0ubuntu1 amd64 GNU dbm database support for Python 3.x ii python3-gi 3.48.2-1 amd64 Python 3 bindings for gobject-introspection libraries ii python3-httplib2 0.20.4-3 all comprehensive HTTP client library written for Python3 ii python3-ibus-1.0 1.5.29-2 all Intelligent Input Bus - introspection overrides for Python (Python 3) ii python3-idna 3.6-2ubuntu0.1 all Python IDNA2008 (RFC 5891) handling (Python 3) ii python3-jinja2 3.1.2-1ubuntu1.3 all small but fast and easy to use stand-alone template engine ii python3-json-pointer 2.0-0ubuntu1 all resolve JSON pointers - Python 3.x ii python3-jsonpatch 1.32-3 all library to apply JSON patches - Python 3.x ii python3-jsonschema 4.10.3-2ubuntu1 all An(other) implementation of JSON Schema (Draft 3, 4, 6, 7) ii python3-jwt 2.7.0-1 all Python 3 implementation of JSON Web Token ii python3-launchpadlib 1.11.0-6 all Launchpad web services client library (Python 3) ii python3-lazr.restfulclient 0.14.6-1 all client for lazr.restful-based web services (Python 3) ii python3-lazr.uri 1.0.6-3 all library for parsing, manipulating, and generating URIs ii python3-louis 3.29.0-1build1 all Python bindings for liblouis ii python3-mako 1.3.2-1 all fast and lightweight templating for the Python 3 platform ii python3-markdown-it 3.0.0-2 all Python port of markdown-it and some its associated plugins ii python3-markupsafe 2.1.5-1build2 amd64 HTML/XHTML/XML string library ii python3-mdurl 0.1.2-1 all Python port of the JavaScript mdurl package rF python3-minimal 3.12.3-0ubuntu2 amd64 minimal subset of the Python language (default python3 version) ii python3-monotonic 1.6-2 all implementation of time.monotonic() - Python 3.x ii python3-nacl 1.5.0-4build1 amd64 Python bindings to libsodium (Python 3) ii python3-netaddr 0.8.0-2ubuntu1 all manipulation of various common network address notations (Python 3) ii python3-netifaces:amd64 0.11.0-2build3 amd64 portable network interface information - Python 3.x ii python3-netplan 1.1.2-2~ubuntu24.04.1 amd64 Declarative network configuration Python bindings ii python3-oauthlib 3.2.2-1 all generic, spec-compliant implementation of OAuth for Python3 ii python3-olefile 0.46-3 all Python module to read/write MS OLE2 files ii python3-paramiko 2.12.0-2ubuntu4.1 all Make ssh v2 connections (Python 3) ii python3-pexpect 4.9-2 all Python 3 module for automating interactive applications ii python3-pil:amd64 10.2.0-1ubuntu1 amd64 Python Imaging Library (Python3) ii python3-pip 24.0+dfsg-1ubuntu1.2 all Python package installer ii python3-pip-whl 24.0+dfsg-1ubuntu1.2 all Python package installer (pip wheel) ii python3-pkg-resources 68.1.2-2ubuntu1.2 all Package Discovery and Resource Access using pkg_resources ii python3-problem-report 2.28.1-0ubuntu3.7 all Python 3 library to handle problem reports ii python3-ptyprocess 0.7.0-5 all Run a subprocess in a pseudo terminal from Python 3 ii python3-pygments 2.17.2+dfsg-1 all syntax highlighting package written in Python 3 ii python3-pyparsing 3.1.1-1 all alternative to creating and executing simple grammars - Python 3.x ii python3-pyrsistent:amd64 0.20.0-1build2 amd64 persistent/functional/immutable data structures for Python ii python3-requests 2.31.0+dfsg-1ubuntu1.1 all elegant and simple HTTP library for Python3, built for human beings ii python3-rich 13.7.1-1 all render rich text, tables, progress bars, syntax highlighting, markdown and more ii python3-serial 3.5-2 all pyserial - module encapsulating access for the serial port ii python3-setuptools 68.1.2-2ubuntu1.2 all Python3 Distutils Enhancements ii python3-setuptools-whl 68.1.2-2ubuntu1.2 all Python Distutils Enhancements (wheel package) ii python3-six 1.16.0-4 all Python 2 and 3 compatibility library ii python3-software-properties 0.99.49.2 all manage the repositories that you install software from ii python3-speechd 0.12.0~rc2-2build3 all Python interface to Speech Dispatcher ii python3-sss 2.9.4-1.1ubuntu6.2 amd64 Python3 module for the System Security Services Daemon ii python3-systemd 235-1build4 amd64 Python 3 bindings for systemd ii python3-typing-extensions 4.10.0-1 all Backported and Experimental Type Hints for Python ii python3-tz 2024.1-2 all Python3 version of the Olson timezone database ii python3-uno 4:24.2.7-0ubuntu0.24.04.4 amd64 Python-UNO bridge ii python3-update-manager 1:24.04.12 all Python 3.x module for update-manager ii python3-urllib3 2.0.7-1ubuntu0.2 all HTTP library with thread-safe connection pooling for Python3 ii python3-wadllib 1.3.6-5 all Python 3 library for navigating WADL files ii python3-wheel 0.42.0-2 all built-package format for Python ii python3-xdg 0.28-2 all Python 3 library to access freedesktop.org standards ii python3-xkit 0.5.0ubuntu6 all library for the manipulation of xorg.conf files (Python 3) ii python3-yaml 6.0.1-2build2 amd64 YAML parser and emitter for Python3 ii python3.10-minimal 3.10.4-3 amd64 Minimal subset of the Python language (version 3.10) ii python3.12 3.12.3-1ubuntu0.7 amd64 Interactive high-level object-oriented language (version 3.12) ii python3.12-dev 3.12.3-1ubuntu0.7 amd64 Header files and a static library for Python (v3.12) ii python3.12-minimal 3.12.3-1ubuntu0.7 amd64 Minimal subset of the Python language (version 3.12) ii python3.12-venv 3.12.3-1ubuntu0.7 amd64 Interactive high-level object-oriented language (pyvenv binary, version 3.12)
07-09
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值