protobuf是谷歌开发的一套序列化结构化数据用以通讯协议,存储数据等的框架,支持c++、java、python等语言。Hadoop 0.23及之后版本将使用protobuf实现rpc的序列化及反序列化。这里做了一个实验,在hadoop 0.19上实现用protobuf序列化/反序列一个rpc的返回值。
使用protobuf需要首先下载并安装,大概步骤是下载并解压tar包后,依次执行下面步骤:
[protobuf-2.4.1]$ ./configure
[protobuf-2.4.1]$ make
[protobuf-2.4.1]$ make check
[protobuf-2.4.1]$ make install # 该步骤需要root权限(sudo)
由于hadoop使用的java语言,需要到java目录下编译jar包,步骤如下:
[protobuf-2.4.1/java]$ mvn test # 需要本地先安装maven哦
[protobuf-2.4.1/java]$ mvn install
[protobuf-2.4.1/java]$ mvn package #该步骤会在target目录上生成一个jar包,
#包名为:protobuf-java-2.4.1.jar 该jar包需要放到hadoop的lib目录下,
#供编译及运行时使用。
本地环境安装完后,下一步是写proto文件。为了方便,这里选择了用proto写ClusterStatus类,对应的proto文件内容如下:
package mapred; option java_package = "org.apache.hadoop.mapred"; option java_outer_classname = "ClusterStatusProtos"; message ClusterStatus { optional int32 task_trackers = 1; optional int32 map_tasks = 2; optional int32 reduce_tasks = 3; optional int32 max_map_tasks = 4; optional int32 max_reduce_tasks = 5; enum JTState { INITIALIZING = 0; RUNNING = 1; } optional JTState state = 6; }proto文件写好后,用protoc工具编译下生成对应的java文件,命令行如下:
protoc -I=$SRC_DIR --java_out=$DST_DIR $SRC_DIR/my.proto
其实主要是用--java_out制定下生成的java文件放到哪里去。上面的proto文件编译后生成的文件为: org/apache/hadoop/mapred/ClusterStatusProtos.java。
下一步是修改ClusterStatus.java文件,为了不需要改变ObjectWritable.java中hadoop对象的序列化方式,我们在ClusterStatus类中包装了一个ClusterStatusProtos.ClusterStatus.Builder对象status,去掉该类的所有成员变量并把所有接口改为操作status对象,对应的diff文件如下:
Index: src/mapred/org/apache/hadoop/mapred/ClusterStatus.java =================================================================== --- src/mapred/org/apache/hadoop/mapred/ClusterStatus.java (revision 106658) +++ src/mapred/org/apache/hadoop/mapred/ClusterStatus.java (working copy) @@ -21,9 +21,12 @@ import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; +import java.io.InputStream; import org.apache.hadoop.io.Writable; -import org.apache.hadoop.io.WritableUtils; +import org.apache.hadoop.io.DataOutputOutputStream; +import org.apache.hadoop.mapred.ClusterStatusProtos; +import org.apache.hadoop.mapred.ClusterStatusProtos.ClusterStatus.JTState; /** * Status information on the current state of the Map-Reduce cluster. @@ -50,14 +53,10 @@ * @see JobClient */ public class ClusterStatus implements Writable { + + ClusterStatusProtos.ClusterStatus.Builder status = + ClusterStatusProtos.ClusterStatus.newBuilder(); - private int task_trackers; - private int map_tasks; - private int reduce_tasks; - private int max_map_tasks; - private int max_reduce_tasks; - private JobTracker.State state; - ClusterStatus() {} /** @@ -70,13 +69,13 @@ * @param state the {@link JobTracker.State} of the <code>JobTracker</code> */ ClusterStatus(int trackers, int maps, int reduces, int maxMaps, - int maxReduces, JobTracker.State state) { - task_trackers = trackers; - map_tasks = maps; - reduce_tasks = reduces; - max_map_tasks = maxMaps; - max_reduce_tasks = maxReduces; - this.state = state; + int maxReduces, JTState state) { + status.setTaskTrackers(trackers) + .setMapTasks(maps) + .setReduceTasks(reduces) + .setMaxMapTasks(maxMaps) + .setMaxReduceTasks(maxReduces) + .setState(state); } @@ -86,7 +85,7 @@ * @return the number of task trackers in the cluster. */ public int getTaskTrackers() { - return task_trackers; + return status.getTaskTrackers(); } /** @@ -95,7 +94,7 @@ * @return the number of currently running map tasks in the cluster. */ public int getMapTasks() { - return map_tasks; + return status.getMapTasks(); } /** @@ -104,7 +103,7 @@ * @return the number of currently running reduce tasks in the cluster. */ public int getReduceTasks() { - return reduce_tasks; + return status.getReduceTasks(); } /** @@ -113,7 +112,7 @@ * @return the maximum capacity for running map tasks in the cluster. */ public int getMaxMapTasks() { - return max_map_tasks; + return status.getMaxMapTasks(); } /** @@ -122,7 +121,7 @@ * @return the maximum capacity for running reduce tasks in the cluster. */ public int getMaxReduceTasks() { - return max_reduce_tasks; + return status.getMaxReduceTasks(); } /** @@ -131,26 +130,17 @@ * * @return the current state of the <code>JobTracker</code>. */ - public JobTracker.State getJobTrackerState() { - return state; + public JTState getJobTrackerState() { + return status.getState(); } public void write(DataOutput out) throws IOException { - out.writeInt(task_trackers); - out.writeInt(map_tasks); - out.writeInt(reduce_tasks); - out.writeInt(max_map_tasks); - out.writeInt(max_reduce_tasks); - WritableUtils.writeEnum(out, state); + status.build().writeDelimitedTo( + DataOutputOutputStream.constructOutputStream(out)); } public void readFields(DataInput in) throws IOException { - task_trackers = in.readInt(); - map_tasks = in.readInt(); - reduce_tasks = in.readInt(); - max_map_tasks = in.readInt(); - max_reduce_tasks = in.readInt(); - state = WritableUtils.readEnum(in, JobTracker.State.class); + status = ClusterStatusProtos.ClusterStatus.parseDelimitedFrom((InputStream)in).toBuilder(); } }需要注意的是,由于protobuf在输入或输出时只接受InputStream/OutputStream对象,而hadoop的输入流是DataOutput/DataInput,这就需要进行转换。对于DataOutput到OutputStream的转换,采用HADOOP-7379的DataOutputOutputStream.java的wrap方法,而DataInput到InputStream,则是通过强制转换实现的。然后调整JobTracker.java及LocalJobRunner.java等文件后,就可以编译通过并运行了。
遗憾的是,修改后没能测试性能优势。
可能遇见的问题:
编译报错或运行时错误,提示类或方法找不到。
解决方案:
1、确保protobuf-java-2.4.1.jar包已经放到hadoop的lib目录下了;
2、全量编译:ant clean; ant。
参考资料:
1、http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
2、https://issues.apache.org/jira/browse/HADOOP-7379
本文详细介绍了如何在Hadoop 0.19版本上使用protobuf框架实现RPC序列化与反序列化,包括protobuf的安装配置、编写proto文件、编译生成Java文件、修改现有类文件以适应protobuf序列化需求,并最终编译运行实现性能优化的过程。尽管未能直接测试性能优势,但提供了完整的实现路径供参考。

1175

被折叠的 条评论
为什么被折叠?



