SequenceFile是一种基于文件的数据结构,专门用于存贮大文件。其特点就是利用二进制键值对存储数据
一、SequenceFile写操作
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
// vv SequenceFileWriteDemo
public class SequenceFileWriteDemo {
private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"
};
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
IntWritable key = new IntWritable();
Text value = new Text();
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());
for (int i = 0; i < 100; i++) {
key.set(100 - i);
value.set(DATA[i % DATA.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
writer.append(key, value);
}
} finally {
IOUtils.closeStream(writer);
}
}
}
SequenceFile的输入比较简单,就是通过SequenceFile.createWriter创建一个实例,利用这个实例的append方法可以按照键值对的形式写入数据
截取前面一部分运行结果。
$ hadoop SequenceFileWriteDemo numbers.seq
13/11/06 21:48:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/06 21:48:39 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/11/06 21:48:39 INFO compress.CodecPool: Got brand-new compressor
[128] 100
One, two, buckle my shoe
[173] 99
Three, four, shut the door
[220] 98
Five, six, pick up sticks
[264] 97
Seven, eight, lay them straight
[314] 96
Nine, ten, a big fat hen
[359] 95
One, two, buckle my shoe
[404] 94
Three, four, shut the door
[451] 93
Five, six, pick up sticks
[495] 92
Seven, eight, lay them straight
[545] 91
Nine, ten, a big fat hen
[590] 90
One, two, buckle my shoe
[635] 89
Three, four, shut the door
二、读取SequenceFile
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
// vv SequenceFileReadDemo
public class SequenceFileReadDemo {
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
ReflectionUtils.newInstance(reader.getValueClass(), conf);
/**如此一来,我们就不需要知道具体的文件数据类型是什么,全部利用Writable进行读取,注意此处只是先将key和value实例化了,但里边是没有任何数据的。需要注意如何 *通过调用getKeyClass()和getValueClass()得到SequenceFile.Reader找到的类型,然后RflectionUtils用来创建键、值的实例*/
long position = reader.getPosition();
while (reader.next(key, value)) {
String syncSeen = reader.syncSeen() ? "*" : "";
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
position = reader.getPosition(); // beginning of next record
}
} finally {
IOUtils.closeStream(reader);
}
/**next才将key和value赋予了真正的值,然后syncSeen()返回true当且仅当先前调用next时经过了一个同步标志,注意next是一条条地读取的,但同步标识不是每一条记录后 *边都有,而是一个数据块后才会有,所以经过多条记录才会出现一个同步标志*/
}
}
运行结果:
$ hadoop SequenceFileReadDemo numbers.seq
13/11/06 21:50:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/06 21:50:44 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/11/06 21:50:44 INFO compress.CodecPool: Got brand-new decompressor
[128] 100
One, two, buckle my shoe
[173] 99
Three, four, shut the door
[220] 98
Five, six, pick up sticks
[264] 97
Seven, eight, lay them straight
[314] 96
Nine, ten, a big fat hen
[359] 95
One, two, buckle my shoe
[404] 94
Three, four, shut the door
[451] 93
Five, six, pick up sticks
[495] 92
Seven, eight, lay them straight
[545] 91
Nine, ten, a big fat hen
[590] 90
One, two, buckle my shoe
[635] 89
Three, four, shut the door
[682] 88
Five, six, pick up sticks
[726] 87
Seven, eight, lay them straight
[776] 86
Nine, ten, a big fat hen
[821] 85
One, two, buckle my shoe
[866] 84
Three, four, shut the door
[913] 83
Five, six, pick up sticks
[957] 82
Seven, eight, lay them straight
[1007] 81
Nine, ten, a big fat hen
[1052] 80
One, two, buckle my shoe
[1097] 79
Three, four, shut the door
[1144] 78
Five, six, pick up sticks
[1188] 77
Seven, eight, lay them straight
[1238] 76
Nine, ten, a big fat hen
[1283] 75
One, two, buckle my shoe
[1328] 74
Three, four, shut the door
[1375] 73
Five, six, pick up sticks
[1419] 72
Seven, eight, lay them straight
[1469] 71
Nine, ten, a big fat hen
[1514] 70
One, two, buckle my shoe
[1559] 69
Three, four, shut the door
[1606] 68
Five, six, pick up sticks
[1650] 67
Seven, eight, lay them straight
[1700] 66
Nine, ten, a big fat hen
[1745] 65
One, two, buckle my shoe
[1790] 64
Three, four, shut the door
[1837] 63
Five, six, pick up sticks
[1881] 62
Seven, eight, lay them straight
[1931] 61
Nine, ten, a big fat hen
[1976] 60
One, two, buckle my shoe
[2021*] 59
Three, four, shut the door
[2088] 58
Five, six, pick up sticks
[2132] 57
Seven, eight, lay them straight
[2182] 56
Nine, ten, a big fat hen
[2227] 55
One, two, buckle my shoe
[2503] 49
Three, four, shut the door
[2550] 48
Five, six, pick up sticks
[2272] 54
Three, four, shut the door
[2319] 53
Five, six, pick up sticks
[2363] 52
Seven, eight, lay them straight
[2413] 51
Nine, ten, a big fat hen
[2458] 50
One, two, buckle my shoe
[2594] 47
Seven, eight, lay them straight
[2644] 46
Nine, ten, a big fat hen
[2689] 45
One, two, buckle my shoe
[2734] 44
Three, four, shut the door
[2781] 43
Five, six, pick up sticks
[2825] 42
Seven, eight, lay them straight
[2875] 41
Nine, ten, a big fat hen
这部分后边还讲到MapFile,其实将SequenceFile经过排序之后就是MapFile,所以一个MapFile包含两个文件,一个文件是SequenceFile文件,还有一个是索引文件,所以写文件的方式和SequenceFile完全一样。MapFile的读取是可以指定读取位置的(具体书上有介绍),而且将SequenceFile文件转化为MapFile的方式也很简单,就是添加一个索引文件。
上面只是我在读书的时候做的一点批注,具体知识还需要去看书才可以,只是边看书边写下自己的理解会很有收获。