Hadoop的I/O操作——序列化(二)
1. Writable数据类型
Writable类对除了char类型之外(char类型可以存储在IntWritable里)的所有Java基本类型提供了封装。
1.1 Text
Text是针对UTF-8编码的字符序列的Writable实现,它可以看成是对java.lang.String的包装。
Text有如下特点:
(1)使用整型来存储字符串编码中所需的字节数,因此其最大值为2GB。
(2)使用标准的UTF-8编码(关于UTF-8编码,可以看我的另一篇博客https://blog.youkuaiyun.com/hh66__66hh/article/category/8228199),UTF-8是一种编码方式,它实现对Unicode的编码。例如下面这个表,第一行是Unicode的码点,第二行是其对应的UTF-8编码,第三行是其对应的Java表示,即UTF-16编码。
Text与String的区别
由于Text类是使用UTF-8进行编码,而java.lang.String类使用的是UTF-16进行编码,因此它们之间还是存在着一点差别的:
(1)对某个Unicode的索引,Text类(Text类的find方法)是根据UTF-8编码后字节序列中的位置(字节偏移,以字节为单位)来实现的,而非字符串中的Unicode字符,也不是Java char的编码单元;而String类(String类的indexOf方法)则是根据UTF-16编码后字符串中字符序列的位置(字符char的偏移,以char,即2个字节为单位)来实现的。例如:
import java.io.*;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static void text() {
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
String s = "\u0041\u00DF\u6771\uD801\uDC00";
System.out.println("s.indexOf(\"\\u0041\") is " + s.indexOf("\u0041"));
System.out.println("s.indexOf(\"\\u00DF\") is " + s.indexOf("\u00DF"));
System.out.println("s.indexOf(\"\\u6771\") is " + s.indexOf("\u6771"));
System.out.println("s.indexOf(\"\\uD801\\uDC00\") is " + s.indexOf("\uD801\uDC00"));
System.out.println("t.find(\"\\u0041\") is " + t.find("\u0041"));
System.out.println("t.find(\"\\u00DF\") is " + t.find("\u00DF"));
System.out.println("t.find(\"\\u6771\") is " + t.find("\u6771"));
System.out.println("t.find(\"\\uD801\\uDC00\") is " + t.find("\uD801\uDC00"));
}
public static void main(String[] args) throws Exception {
text();
}
}
运行结果如下:
s.indexOf("\u0041") is 0
s.indexOf("\u00DF") is 1
s.indexOf("\u6771") is 2
s.indexOf("\uD801\uDC00") is 3
t.find("\u0041") is 0
t.find("\u00DF") is 1
t.find("\u6771") is 3
t.find("\uD801\uDC00") is 6
分析:
上表是这几个Unicode的UTF-8编码和UTF-16编码,可以看到:
1) U+0041的UTF-8编码为41,占用1个字节,因此t.find("\u00DF") 的值即为U+0041所占空间的大小,即为1;UTF-16编码为\u0041,占用1个char,因此s.indexOf("\u00DF") 的值即为U+0041所占空间的大小,即为1;
2)U+00DF的UTF-8编码为C39f,占用2个字节,因此t.find("\u6771") 的值即为U+0041所占空间的大小与U+00DF所占空间的大小之和,即为3;UTF-16编码为\u00DF,占用1个char,因此s.indexOf("\u6771") 的值即为U+0041所占空间的大小与U+00DF所占空间的大小之和,即为2;
3)U+6771的UTF-8编码为e691b1,占用3个字节,因此t.find("\uD801\uDC00") 的值即为U+0041所占空间的大小、U+00DF所占空间的大小与U+6771所占空间大小之和,即为6;UTF-16编码为\u6771,占用1个char,因此s.indexOf("\u6771") 的值即为U+0041所占空间的大小、U+00DF所占空间的大小与U+6771所占空间大小之和,即为3;
(2)要想得到指定位置的数据,Text(charAt方法)和String(charAt方法)也存在不同:
//Text类的charAt方法
public int charAt(int position)
//Returns the Unicode Scalar Value (32-bit integer value) for the character at position. Note that this method avoids using the converter or doing String instantiation
//String类的charAt方法
public char charAt(int index)
//Returns the char value at the specified index. An index ranges from 0 to length() - 1. The first char value of the sequence is at index 0, the next at index 1, and so on, as for array indexing.
指定的位置类型不同:Text的charAt指定的position是字节偏移,而String类的index指的是char类型偏移;
返回的类型不同:Text的charAt返回的是Unicode值,是个32位的整型;charAt返回的是char类型。
下面举个例子:
import java.io.*;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static void text() {
int value;
String hexstring;
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
String s = "\u0041\u00DF\u6771\uD801\uDC00";
value = s.charAt(0);
System.out.println("s.charAt(0) is \\u" + Integer.toHexString(value));
value = s.charAt(1);
System.out.println("s.charAt(1) is \\u" + Integer.toHexString(value));
value = s.charAt(2);
System.out.println("s.charAt(2) is \\u" + Integer.toHexString(value));
value = s.charAt(3);
System.out.println("s.charAt(3) is \\u" + Integer.toHexString(value));
value = s.charAt(4);
System.out.println("s.charAt(4) is \\u" + Integer.toHexString(value));
value = t.charAt(0);
System.out.println("t.charAt(0) is \\u" + Integer.toHexString(value));
value = t.charAt(1);
System.out.println("t.charAt(1) is \\u" + Integer.toHexString(value));
value = t.charAt(3);
System.out.println("t.charAt(3) is \\u" + Integer.toHexString(value));
value = t.charAt(6);
System.out.println("s.charAt(6) is \\u" + Integer.toHexString(value));
}
public static void main(String[] args) throws Exception {
text();
}
}
运行的结果是:
s.charAt(0) is \u41
s.charAt(1) is \udf
s.charAt(2) is \u6771
s.charAt(3) is \ud801
s.charAt(4) is \udc00
t.charAt(0) is \u41
t.charAt(1) is \udf
t.charAt(3) is \u6771
s.charAt(6) is \u10400
(3)Text的长度定义为UTF-8编码的字节数,String的长度定义为UTF-16编码的字符个数。例如:
import java.io.*;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static void text() {
int value;
String hexstring;
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
String s = "\u0041\u00DF\u6771\uD801\uDC00";
System.out.println("t.getLength() is " + Integer.toString(t.getLength()));
System.out.println("s.length() is " + Integer.toString(s.length()));
}
public static void main(String[] args) throws Exception {
text();
}
}
运行结果是:
t.getLength() is 10
s.length() is 5
Text中每个字符的遍历
由于Text是利用字节偏移量来实现索引的,因此不能简单地通过索引值来实现对Text类中Unicode字符的迭代。
(1)具体的方法是:
A. 将Text对象转换为java.nio.ByteBuffer对象
利用Java的wrap方法,将Text对象转换为java.nio.ByteBuffer对象:
package java.nio.ByteBuffer;
public static ByteBuffer wrap(byte[] array, int offset, int length)
其中,array是一个要被写入缓冲区的byte数组,offset是byte数组的起始位置,length是要写入缓冲区的长度。这里需要利用Text类的getBytes方法来获取Text对象的byte数组:
public byte[] getBytes()
B. 利用缓冲区对Text对象反复调用bytesToCodePoint静态方法,该方法会返回缓冲区buffer里下一个Unicode字符的码点(以int返回),并自动更新buffer里下一个字符的位置指针,直到字符串的末尾时返回-1:
public static int bytesToCodePoint(ByteBuffer bytes)
(2)具体实现代码:
import java.io.*;
import java.nio.ByteBuffer;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static void TextIterator() {
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());
int cp;
while(buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf))!=-1) {
System.out.println(Integer.toHexString(cp));
}
}
public static void main(String[] args) throws Exception {
TextIterator();
}
}
运行结果:
41
df
6771
10400
Text的可变性
与String类相比,Text类的另一个区别在于它是可变的,可以通过调用它的set方法来重用Text实例。set方法进行了重载,注意其中的第四个,它其实是将另一个Text类拷贝到这个Text类里,它与其他3个有些不同:
//第一个:Set to contain the contents of a string.
public void set(String string)
//第二个:Set to a utf8 byte array
public void set(byte[] utf8)
//第三个:Set the Text to range of bytes
public void set(byte[] utf8, int start, int len)
//第四个:copy a text.
public void set(Text other)
另外,getBytes().length方法返回的是字节数组的长度(即该Writable类对象的容量),这无法体现实际存储数据的大小,因此需要通过getLength方法来确定其实际大小,例如例如以下这个例子:
import java.io.*;
import java.nio.ByteBuffer;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static void TextReset() {
Text t = new Text("evening");
t.set("night");
System.out.println("t.getLength() is " + Integer.toString(t.getLength()));
System.out.println("t.getBytes().length is " + Integer.toString(t.getBytes().length));
Text t1 = new Text("evening");
t1.set(new Text("night"));
System.out.println("t1.getLength() is " + Integer.toString(t1.getLength()));
System.out.println("t1.getBytes().length is " + Integer.toString(t1.getBytes().length));
}
public static void main(String[] args) throws Exception {
TextReset();
}
}
运行结果:
t.getLength() is 5
t.getBytes().length is 5
t1.getLength() is 5
t1.getBytes().length is 7
1.2 BytesWritable
BytesWritable是对二进制数组的封装,它的序列化格式为一个指定所包含数据字节数的整数域(占4个字节),后面为数据内容本身,例如:
import java.io.*;
import java.nio.ByteBuffer;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static byte[] serialize(Writable writable) throws Exception{
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream output = new DataOutputStream(out);
writable.write(output);
output.close();
return out.toByteArray();
}
public static void BytesWritableTest() throws Exception{
byte[] b1 = new byte[] {3, 5};
BytesWritable b = new BytesWritable(b1);
byte[] b2 = serialize(b);
System.out.println("StringUtils.byteToHexString(b1) is " + StringUtils.byteToHexString(b1));
System.out.println("StringUtils.byteToHexString(b2) is " + StringUtils.byteToHexString(b2));
}
public static void main(String[] args) throws Exception {
BytesWritableTest();
}
}
另外,BytesWritable是可变的,也是可以通过set方法进行修改。另外,getBytes().length方法返回的是字节数组的长度(即该Writable类对象的容量),这无法体现实际存储数据的大小,因此需要通过getLength来确定其实际大小,例如:
import java.io.*;
import java.nio.ByteBuffer;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static void BytesWritableTest() throws Exception{
byte[] b1 = new byte[] {3, 5};
BytesWritable b = new BytesWritable(b1);
b.setCapacity(20);
System.out.println("b.getBytes().length is " + String.valueOf(b.getBytes().length));
System.out.println("b.getLength() is " + String.valueOf(b.getLength()));
}
public static void main(String[] args) throws Exception {
BytesWritableTest();
}
}
运行结果为:
b.getBytes().length is 20
b.getLength() is 2
1.3 NullWritable
NullWritable是Writable的特殊类型,它的序列化长度为0。NullWritable充当的是占位符,可以认为是对Java中null的包装,它并不从数据流中读取数据,也不写入数据。
1.4 ObjecWritable
ObjectWritable是对Java基本类型(String,enum,Writable,null或这些类型组成的数组)的一共通用封装。当一个字段里包含多个类型时,ObjectWritable非常有用,但作为一个通用的机制,它每次在序列化的时候都要写入封装类型的名称,这非常浪费空间,例如:
import java.io.*;
import java.nio.ByteBuffer;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static byte[] serialize(Writable writable) throws Exception{
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream output = new DataOutputStream(out);
writable.write(output);
output.close();
return out.toByteArray();
}
public static void ObjectWritableTest() throws Exception{
ObjectWritable obj = new ObjectWritable(new Text("\u00DF"));
byte[] bytes = serialize(obj);
System.out.println("bytes is " + StringUtils.byteToHexString(bytes));
}
public static void main(String[] args) throws Exception {
ObjectWritableTest();
}
}
运行的结果为:
bytes is 00196f72672e6170616368652e6861646f6f702e696f2e5465787400196f72672e6170616368652e6861646f6f702e696f2e5465787402c39f
可以看到,仅仅是保存一个Unicode就要占用这么多的空间。
1.5 GenericWritable
从上面的介绍可以看到,ObjectWritable很浪费空间,如果封装的类型数量比较少并且能够提前知道的情况下,就可以采用静态类型的数组来保存这些类型,这样在序列化的时候就不需要将被序列化的类的类型名字写入字节流,而只需要写入其在静态类型数组中的索引号就可以了,这样可以大大节省空间,GenericWritable类就实现了这个目标。具体来说,在使用的时候需要自己写一个派生类,并且这个类继承GenericWritable类,在派生类里需要自己初始化静态类型数组,下面举个实际的例子来具体介绍:
import java.io.*;
import java.nio.ByteBuffer;
import java.lang.Class;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.GenericWritable;
import org.apache.hadoop.io.*;
class MyWritable extends GenericWritable {
MyWritable(Writable writable) {
set(writable); //调用父类方法来设置要被序列化的对象
}
//CLASSES静态数组用来保存要被序列化的数据的类型
public static Class<? extends Writable>[] CLASSES = null;
static {
//初始化CLASSES类型数组,这里保存了两种类型
CLASSES = (Class<? extends Writable>[]) new Class[] {BytesWritable.class, Text.class};
}
@Override
//重新实现父类的getTypes方法,返回静态类型数组
protected Class<? extends Writable>[] getTypes() {
return CLASSES;
}
}
public class WritableKinds {
public static byte[] serialize(Writable writable) throws Exception{
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream output = new DataOutputStream(out);
writable.write(output);
output.close();
return out.toByteArray();
}
public static void GenericWritableTest() throws Exception{
Text t = new Text("\u0043\u00DF");
MyWritable mywritable = new MyWritable(t);
System.out.println("StringUtils.byteToHexString(serialize(t)) is " + StringUtils.byteToHexString(serialize(t)));
System.out.println("StringUtils.byteToHexString(serialize(mywritable)) is " + StringUtils.byteToHexString(serialize(mywritable)));
}
public static void main(String[] args) throws Exception {
GenericWritableTest();
}
}
运行结果:
StringUtils.byteToHexString(serialize(t)) is 0343c39f
StringUtils.byteToHexString(serialize(mywritable)) is 010343c39f
对比t和mywritable的序列化结果可以看到,mywritable的序列化结果前多了一个索引号1,即TextWritable在静态类型数组中的索引号。
1.6 Writable集合类
org.apache.hadoop.io软件包里一共有6个Writable集合类:ArrayWritable、TwoDArrayWritable、ArrayPrimitiverWritable、MapWritable、SortedMapWritable以及EnumMapWritable。下面介绍其中几个:
(1)ArrayWritable和TwoDArrayWritable
ArrayWritable和TwoDArrayWritable是Writable类型的一维和二维数组,数组里的元素都是Writable类型,元素的类型需要通过构造函数来指定:
ArrayWritable array = new ArrayWritable(Text.class);
另外,ArrayWritable和TwoDArrayWritable的子类需要通过super来指定ArrayWritable和TwoDArrayWritable中元素的类型:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
}
ArrayWritable和TwoDArrayWritable的toArray方法可以返回被包装的数组。
(2)MapWritable
MapWritable是对java.util.Map<Writable, Writable>的实现,其键和值都会被序列化。举个实例:
import java.io.*;
import java.nio.ByteBuffer;
import java.lang.Class;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.GenericWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.*;
public class WritableKinds {
public static byte[] serialize(Writable writable) throws Exception{
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream output = new DataOutputStream(out);
writable.write(output);
output.close();
return out.toByteArray();
}
public static void MapWritableTest() throws Exception{
MapWritable map = new MapWritable();
map.put(new IntWritable(1), new Text("hehe"));
map.put(new IntWritable(2), new IntWritable(3));
/*get方法在根据key获取value的时候需要对返回结果进行强制类型转换,因为在set的时候没有指定key和value的类型,所以返回的是Object*/
Text t1 = (Text)map.get(new IntWritable(1));
IntWritable i1 = (IntWritable)map.get(new IntWritable(2));
System.out.println("t1.toString() is " + t1.toString());
System.out.println("i1.get() is " + String.valueOf(i1.get()));
}
public static void main(String[] args) throws Exception {
MapWritableTest();
}
}
运行结果:
t1.toString() is hehe
i1.get() is 3
2. 实现自定义的Writable类型
自定义一个名为TextPair的Writable类型,它是一对Text类,且实现WritableComparable接口。在实现的过程中,主要分为几个部分:构造函数(这里可以进行重载,实现多种构造函数)、重写Writable的write方法和readFields方法、实现Text类的toString方法、重写WritableComparable接口的compareTo方法:
import java.io.*;
import java.nio.ByteBuffer;
import java.lang.Class;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.GenericWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.*;
class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
first = new Text();
second = new Text();
}
public TextPair(String s1, String s2) {
first = new Text(s1);
second = new Text(s2);
}
public TextPair(Text t1, Text t2) {
first = new Text(t1);
second = new Text(t2);
}
public void set(Text t1, Text t2) {
first = t1;
second = t2;
}
@Override
public void write(DataOutput out) throws IOException{
first.write(out);
second.write(out);
}
@Override
public void readFields(DataInput in) throws IOException{
first.readFields(in);
second.readFields(in);
}
@Override
public String toString() {
return first.toString() + '\t' + second.toString();
}
@Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if(cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
}
public class WritableKinds {
public static byte[] serialize(Writable writable) throws Exception{
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream output = new DataOutputStream(out);
writable.write(output);
output.close();
return out.toByteArray();
}
public static void main(String[] args) throws Exception {
TextPair t1 = new TextPair("hehe", "haha");
TextPair t2 = new TextPair("hehe", "ahaha");
System.out.println("t1.toString() is " + t1.toString());
System.out.println("t2.toString() is " + t2.toString());
int cmp = t1.compareTo(t2);
System.out.println("cmp is " + String.valueOf(cmp));
}
}
运行结果:
t1.toString() is hehe haha
t2.toString() is hehe ahaha
cmp is 7