Hadoop序列化文件SequenceFile可以用于解决大量小文件(所谓小文件:泛指小于black大小的文件)问题,SequenceFile是Hadoop API提供的一种二进制文件支持。这种二进制文件直接将<key,value>对序列化到文件中,一般对小文件可以使用这种文件合并,即将文件名作为key,文件内容作为value序列化到大文件中。
hadoop Archive也是一个高效地将小文件放入HDFS块中的文件存档文件格式,详情请看:hadoop Archive
但是SequenceFile文件不能追加写入,适用于一次性写入大量小文件的操作。
SequenceFile的压缩基于CompressType,请看源码:
/**
* The compression type used to compress key/value pairs in the
* {@link SequenceFile}.
* @see SequenceFile.Writer
*/
public static enum CompressionType {
/** Do not compress records. */
NONE, //不压缩
/** Compress values only, each separately. */
RECORD, //只压缩values
/** Compress sequences of records together in blocks. */
BLOCK //压缩很多记录的key/value组成块
}
SequenceFile读写示例:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.SequenceFile.Reader;
import org.apache.hadoop.io.SequenceFile.Writer;
import org.apache.hadoop.io.Text;
/**
* @version 1.0
* @author Fish
*/
public class SequenceFileWriteDemo {
private static final String[] DATA = { "fish1", "fish2", "fish3", "fish4" };
public static void main(String[] args) throws IOException {
/**
* 写SequenceFile
*/
String uri = "/test/fish/seq.txt";
Configuration conf = new Configuration();
Path path = new Path(uri);
IntWritable key = new IntWritable();
Text value = new Text();
Writer writer = null;
try {
/**
* CompressionType.NONE 不压缩<br>
* CompressionType.RECORD 只压缩value<br>
* CompressionType.BLOCK 压缩很多记录的key/value组成块
*/
writer = SequenceFile.createWriter(conf, Writer.file(path), Writer.keyClass(key.getClass()),
Writer.valueClass(value.getClass()), Writer.