一、背景
1、当进程在进行远程通信时,彼此可以发送各种类型的数据,无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输,称为对象序列化;接收方则需要把字节序列恢复为对象,称为对象的反序列化。
2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。
3、Hive可以方便的将数据加载到表中而不需要对数据进行转换,这样在处理海量数据时可以节省大量的时间。
二、技术细节
1、SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化。
2、用户在建表时可以用自定义的SerDe或使用Hive自带的SerDe,SerDe能为表指定列,且对列指定相应的数据。
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]创建指定SerDe表时,使用row format row_format参数,例如:
a、添加jar包。在hive客户端输入:hive>add jar /run/serde_test.jar; 或者在linux shell端执行命令:${HIVE_HOME}/bin/hive -auxpath /run/serde_test.jar b、建表:create table serde_table row format serde 'hive.connect.TestDeserializer';3、编写序列化类TestDeserializer。实现Deserializer接口的三个函数:
a)初始化:initialize(Configuration conf, Properties tb1)。
b)反序列化Writable类型返回Object:deserialize(Writable blob)。
c)获取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。
- packagecom.sina.hive.test;
- importjava.util.Properties;
- importorg.apache.hadoop.conf.Configuration;
- importorg.apache.hadoop.hive.serde2.SerDeException;
- importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
- importorg.apache.hadoop.io.Writable;
- publicinterfaceDeserializer{
- /**
- **InitializetheHiveDeserializer.*@paramconfSystemproperties*@param
- *tbltableproperties*@throwsSerDeException
- */
- publicvoidinitialize(Configurationconf,Propertiestbl)
- throwsSerDeException;
- /**
- **DeserializeanobjectoutofaWritableblob.*Inmostcases,the
- *returnvalueofthisfunctionwillbeconstantsincethefunction*will
- *reusethereturnedobject.*Iftheclientwantstokeepacopyofthe
- *object,theclientneedstoclonethe*returnedvaluebycalling
- *ObjectInspectorUtils.getStandardObject().*@paramblobTheWritable
- *objectcontainingaserializedobject*@returnAJavaobjectrepresenting
- *thecontentsintheblob.
- */
- publicObjectdeserialize(Writableblob)throwsSerDeException;
- /**
- **Gettheobjectinspectorthatcanbeusedtonavigatethroughthe
- *internal*structureoftheObjectreturnedfromdeserialize(...).
- */
- publicObjectInspectorgetObjectInspector()throwsSerDeException;
- }
package com.sina.hive.test; import java.util.Properties; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hive.serde2.SerDeException; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.io.Writable; public interface Deserializer { /** * * Initialize the HiveDeserializer. * @param conf System properties * @param * tbl table properties * @throws SerDeException */ public void initialize(Configuration conf, Properties tbl) throws SerDeException; /** * * Deserialize an object out of a Writable blob. * In most cases, the * return value of this function will be constant since the function * will * reuse the returned object. * If the client wants to keep a copy of the * object, the client needs to clone the * returned value by calling * ObjectInspectorUtils.getStandardObject(). * @param blob The Writable * object containing a serialized object * @return A Java object representing * the contents in the blob. */ public Object deserialize(Writable blob) throws SerDeException; /** * * Get the object inspector that can be used to navigate through the * internal * structure of the Object returned from deserialize(...). */ public ObjectInspector getObjectInspector() throws SerDeException; }
实现一行数据划分成hive表的time,userid,host,path四个字段的反序列化类。例如:
- packagecom.sina.hive.test;
- importjava.net.MalformedURLException;
- importjava.net.URL;
- importjava.util.ArrayList;
- importjava.util.List;
- importjava.util.Properties;
- importorg.apache.hadoop.conf.Configuration;
- importorg.apache.hadoop.hive.serde2.Deserializer;
- importorg.apache.hadoop.hive.serde2.SerDeException;
- importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
- importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
- importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
- importorg.apache.hadoop.io.Text;
- importorg.apache.hadoop.io.Writable;
- publicclassTestDeserializerimplementsDeserializer{
- privatestaticList<String>FieldNames=newArrayList<String>();
- privatestaticList<ObjectInspector>FieldNamesObjectInspectors=newArrayList<ObjectInspector>();
- static{
- FieldNames.add("time");
- FieldNamesObjectInspectors.add(ObjectInspectorFactory
- .getReflectionObjectInspector(Long.class,
- ObjectInspectorOptions.JAVA));
- FieldNames.add("userid");
- FieldNamesObjectInspectors.add(ObjectInspectorFactory
- .getReflectionObjectInspector(Integer.class,
- ObjectInspectorOptions.JAVA));
- FieldNames.add("host");
- FieldNamesObjectInspectors.add(ObjectInspectorFactory
- .getReflectionObjectInspector(String.class,
- ObjectInspectorOptions.JAVA));
- FieldNames.add("path");
- FieldNamesObjectInspectors.add(ObjectInspectorFactory
- .getReflectionObjectInspector(String.class,
- ObjectInspectorOptions.JAVA));
- }
- @Override
- publicObjectdeserialize(Writableblob){
- try{
- if(blobinstanceofText){
- Stringline=((Text)blob).toString();
- if(line==null)
- returnnull;
- String[]field=line.split("/t");
- if(field.length!=3){
- returnnull;
- }
- List<Object>result=newArrayList<Object>();
- URLurl=newURL(field[2]);
- Longtime=Long.valueOf(field[0]);
- Integeruserid=Integer.valueOf(field[1]);
- result.add(time);
- result.add(userid);
- result.add(url.getHost());
- result.add(url.getPath());
- returnresult;
- }
- }catch(MalformedURLExceptione){
- e.printStackTrace();
- }
- returnnull;
- }
- @Override
- publicObjectInspectorgetObjectInspector()throwsSerDeException{
- returnObjectInspectorFactory.getStandardStructObjectInspector(
- FieldNames,FieldNamesObjectInspectors);
- }
- @Override
- publicvoidinitialize(Configurationarg0,Propertiesarg1)
- throwsSerDeException{
- }
- }
package com.sina.hive.test; import java.net.MalformedURLException; import java.net.URL; import java.util.ArrayList; import java.util.List; import java.util.Properties; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hive.serde2.Deserializer; import org.apache.hadoop.hive.serde2.SerDeException; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; public class TestDeserializer implements Deserializer { private static List<String> FieldNames = new ArrayList<String>(); private static List<ObjectInspector> FieldNamesObjectInspectors = new ArrayList<ObjectInspector>(); static { FieldNames.add("time"); FieldNamesObjectInspectors.add(ObjectInspectorFactory .getReflectionObjectInspector(Long.class, ObjectInspectorOptions.JAVA)); FieldNames.add("userid"); FieldNamesObjectInspectors.add(ObjectInspectorFactory .getReflectionObjectInspector(Integer.class, ObjectInspectorOptions.JAVA)); FieldNames.add("host"); FieldNamesObjectInspectors.add(ObjectInspectorFactory .getReflectionObjectInspector(String.class, ObjectInspectorOptions.JAVA)); FieldNames.add("path"); FieldNamesObjectInspectors.add(ObjectInspectorFactory .getReflectionObjectInspector(String.class, ObjectInspectorOptions.JAVA)); } @Override public Object deserialize(Writable blob) { try { if (blob instanceof Text) { String line = ((Text) blob).toString(); if (line == null) return null; String[] field = line.split("/t"); if (field.length != 3) { return null; } List<Object> result = new ArrayList<Object>(); URL url = new URL(field[2]); Long time = Long.valueOf(field[0]); Integer userid = Integer.valueOf(field[1]); result.add(time); result.add(userid); result.add(url.getHost()); result.add(url.getPath()); return result; } } catch (MalformedURLException e) { e.printStackTrace(); } return null; } @Override public ObjectInspector getObjectInspector() throws SerDeException { return ObjectInspectorFactory.getStandardStructObjectInspector( FieldNames, FieldNamesObjectInspectors); } @Override public void initialize(Configuration arg0, Properties arg1) throws SerDeException { } }
测试HDFS上hive表数据,如下为一条测试数据:
1234567891012 123456http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF
hive> add jar /run/jar/merg_hua.jar; Added /run/jar/merg_hua.jar toclasspath hive> create table serde_table row format serde 'hive.connect.TestDeserializer'; Foundclassforhive.connect.TestDeserializer OK Time taken: 0.028 seconds hive> describe serde_table; OK time bigint from deserializer useridintfrom deserializer host string from deserializer path string from deserializer Time taken: 0.042 seconds hive> select * from serde_table; OK 1234567891012 123456 wiki.apache.org /hadoop/Hive/LanguageManual/UDF Time taken: 0.039 seconds三、总结1、创建Hive表使用序列化时,需要自写一个实现Deserializer的类,并且选用create命令的row format参数。
2、在处理海量数据的时候,如果数据的格式与表结构吻合,可以用到Hive的反序列化而不需要对数据进行转换,可以节省大量的时间。