Hive 资料整理系列 五 Hive-0.5中SerDe概述

本文详细介绍了Hive中的反序列化原理及应用,包括如何通过自定义序列化类实现数据的高效加载与处理,特别针对大规模数据场景下的时间、用户ID、主机名与路径等字段的划分与解析进行了深入探讨。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、背景

1、当进程在进行远程通信时,彼此可以发送各种类型的数据,无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输,称为对象序列化;接收方则需要把字节序列恢复为对象,称为对象的反序列化。

2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

3、Hive可以方便的将数据加载到表中而不需要对数据进行转换,这样在处理海量数据时可以节省大量的时间。

二、技术细节

1、SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化。

2、用户在建表时可以用自定义的SerDe或使用Hive自带的SerDe,SerDe能为表指定列,且对列指定相应的数据。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]

创建指定SerDe表时,使用row format row_format参数,例如:

a、添加jar包。在hive客户端输入:hive>add jar /run/serde_test.jar; 或者在linux shell端执行命令:${HIVE_HOME}/bin/hive -auxpath /run/serde_test.jar b、建表:create table serde_table row format serde 'hive.connect.TestDeserializer';

3、编写序列化类TestDeserializer。实现Deserializer接口的三个函数:

a)初始化:initialize(Configuration conf, Properties tb1)。

b)反序列化Writable类型返回Object:deserialize(Writable blob)。

c)获取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。


  1. packagecom.sina.hive.test;
  2. importjava.util.Properties;
  3. importorg.apache.hadoop.conf.Configuration;
  4. importorg.apache.hadoop.hive.serde2.SerDeException;
  5. importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
  6. importorg.apache.hadoop.io.Writable;
  7. publicinterfaceDeserializer{
  8. /**
  9. **InitializetheHiveDeserializer.*@paramconfSystemproperties*@param
  10. *tbltableproperties*@throwsSerDeException
  11. */
  12. publicvoidinitialize(Configurationconf,Propertiestbl)
  13. throwsSerDeException;
  14. /**
  15. **DeserializeanobjectoutofaWritableblob.*Inmostcases,the
  16. *returnvalueofthisfunctionwillbeconstantsincethefunction*will
  17. *reusethereturnedobject.*Iftheclientwantstokeepacopyofthe
  18. *object,theclientneedstoclonethe*returnedvaluebycalling
  19. *ObjectInspectorUtils.getStandardObject().*@paramblobTheWritable
  20. *objectcontainingaserializedobject*@returnAJavaobjectrepresenting
  21. *thecontentsintheblob.
  22. */
  23. publicObjectdeserialize(Writableblob)throwsSerDeException;
  24. /**
  25. **Gettheobjectinspectorthatcanbeusedtonavigatethroughthe
  26. *internal*structureoftheObjectreturnedfromdeserialize(...).
  27. */
  28. publicObjectInspectorgetObjectInspector()throwsSerDeException;
  29. }
package com.sina.hive.test;


import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.io.Writable;

public interface Deserializer {
	/**
	 * * Initialize the HiveDeserializer. * @param conf System properties * @param
	 * tbl table properties * @throws SerDeException
	 */
	public void initialize(Configuration conf, Properties tbl)
			throws SerDeException;

	/**
	 * * Deserialize an object out of a Writable blob. * In most cases, the
	 * return value of this function will be constant since the function * will
	 * reuse the returned object. * If the client wants to keep a copy of the
	 * object, the client needs to clone the * returned value by calling
	 * ObjectInspectorUtils.getStandardObject(). * @param blob The Writable
	 * object containing a serialized object * @return A Java object representing
	 * the contents in the blob.
	 */
	public Object deserialize(Writable blob) throws SerDeException;

	/**
	 * * Get the object inspector that can be used to navigate through the
	 * internal * structure of the Object returned from deserialize(...).
	 */
	public ObjectInspector getObjectInspector() throws SerDeException;
}

实现一行数据划分成hive表的time,userid,host,path四个字段的反序列化类。例如:

  1. packagecom.sina.hive.test;
  2. importjava.net.MalformedURLException;
  3. importjava.net.URL;
  4. importjava.util.ArrayList;
  5. importjava.util.List;
  6. importjava.util.Properties;
  7. importorg.apache.hadoop.conf.Configuration;
  8. importorg.apache.hadoop.hive.serde2.Deserializer;
  9. importorg.apache.hadoop.hive.serde2.SerDeException;
  10. importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
  11. importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
  12. importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
  13. importorg.apache.hadoop.io.Text;
  14. importorg.apache.hadoop.io.Writable;
  15. publicclassTestDeserializerimplementsDeserializer{
  16. privatestaticList<String>FieldNames=newArrayList<String>();
  17. privatestaticList<ObjectInspector>FieldNamesObjectInspectors=newArrayList<ObjectInspector>();
  18. static{
  19. FieldNames.add("time");
  20. FieldNamesObjectInspectors.add(ObjectInspectorFactory
  21. .getReflectionObjectInspector(Long.class,
  22. ObjectInspectorOptions.JAVA));
  23. FieldNames.add("userid");
  24. FieldNamesObjectInspectors.add(ObjectInspectorFactory
  25. .getReflectionObjectInspector(Integer.class,
  26. ObjectInspectorOptions.JAVA));
  27. FieldNames.add("host");
  28. FieldNamesObjectInspectors.add(ObjectInspectorFactory
  29. .getReflectionObjectInspector(String.class,
  30. ObjectInspectorOptions.JAVA));
  31. FieldNames.add("path");
  32. FieldNamesObjectInspectors.add(ObjectInspectorFactory
  33. .getReflectionObjectInspector(String.class,
  34. ObjectInspectorOptions.JAVA));
  35. }
  36. @Override
  37. publicObjectdeserialize(Writableblob){
  38. try{
  39. if(blobinstanceofText){
  40. Stringline=((Text)blob).toString();
  41. if(line==null)
  42. returnnull;
  43. String[]field=line.split("/t");
  44. if(field.length!=3){
  45. returnnull;
  46. }
  47. List<Object>result=newArrayList<Object>();
  48. URLurl=newURL(field[2]);
  49. Longtime=Long.valueOf(field[0]);
  50. Integeruserid=Integer.valueOf(field[1]);
  51. result.add(time);
  52. result.add(userid);
  53. result.add(url.getHost());
  54. result.add(url.getPath());
  55. returnresult;
  56. }
  57. }catch(MalformedURLExceptione){
  58. e.printStackTrace();
  59. }
  60. returnnull;
  61. }
  62. @Override
  63. publicObjectInspectorgetObjectInspector()throwsSerDeException{
  64. returnObjectInspectorFactory.getStandardStructObjectInspector(
  65. FieldNames,FieldNamesObjectInspectors);
  66. }
  67. @Override
  68. publicvoidinitialize(Configurationarg0,Propertiesarg1)
  69. throwsSerDeException{
  70. }
  71. }
package com.sina.hive.test;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde2.Deserializer;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

public class TestDeserializer implements Deserializer {
	private static List<String> FieldNames = new ArrayList<String>();
	private static List<ObjectInspector> FieldNamesObjectInspectors = new ArrayList<ObjectInspector>();
	static {
		FieldNames.add("time");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(Long.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("userid");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(Integer.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("host");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(String.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("path");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(String.class,
						ObjectInspectorOptions.JAVA));
	}

	@Override
	public Object deserialize(Writable blob) {
		try {
			if (blob instanceof Text) {
				String line = ((Text) blob).toString();
				if (line == null)
					return null;
				String[] field = line.split("/t");
				if (field.length != 3) {
					return null;
				}
				List<Object> result = new ArrayList<Object>();
				URL url = new URL(field[2]);
				Long time = Long.valueOf(field[0]);
				Integer userid = Integer.valueOf(field[1]);
				result.add(time);
				result.add(userid);
				result.add(url.getHost());
				result.add(url.getPath());
				return result;
			}
		} catch (MalformedURLException e) {
			e.printStackTrace();
		}
		return null;
	}

	@Override
	public ObjectInspector getObjectInspector() throws SerDeException {
		return ObjectInspectorFactory.getStandardStructObjectInspector(
				FieldNames, FieldNamesObjectInspectors);
	}

	@Override
	public void initialize(Configuration arg0, Properties arg1)
			throws SerDeException {
	}
}


测试HDFS上hive表数据,如下为一条测试数据:

1234567891012 123456http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

hive> add jar /run/jar/merg_hua.jar; Added /run/jar/merg_hua.jar toclasspath hive> create table serde_table row format serde 'hive.connect.TestDeserializer'; Foundclassforhive.connect.TestDeserializer OK Time taken: 0.028 seconds hive> describe serde_table; OK time bigint from deserializer useridintfrom deserializer host string from deserializer path string from deserializer Time taken: 0.042 seconds hive> select * from serde_table; OK 1234567891012 123456 wiki.apache.org /hadoop/Hive/LanguageManual/UDF Time taken: 0.039 seconds
三、总结

1、创建Hive表使用序列化时,需要自写一个实现Deserializer的类,并且选用create命令的row format参数。

2、在处理海量数据的时候,如果数据的格式与表结构吻合,可以用到Hive的反序列化而不需要对数据进行转换,可以节省大量的时间。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值