Hive 资料整理系列五 Hive-0.5中SerDe概述

最新推荐文章于 2025-07-14 21:00:27 发布

iteye_11539

最新推荐文章于 2025-07-14 21:00:27 发布

阅读量95

点赞数

文章标签： java 大数据 shell

本文详细介绍了Hive中的反序列化原理及应用，包括如何通过自定义序列化类实现数据的高效加载与处理，特别针对大规模数据场景下的时间、用户ID、主机名与路径等字段的划分与解析进行了深入探讨。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、背景

1、当进程在进行远程通信时，彼此可以发送各种类型的数据，无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输，称为对象序列化；接收方则需要把字节序列恢复为对象，称为对象的反序列化。

2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

3、Hive可以方便的将数据加载到表中而不需要对数据进行转换，这样在处理海量数据时可以节省大量的时间。

二、技术细节

1、SerDe是Serialize/Deserilize的简称，目的是用于序列化和反序列化。

2、用户在建表时可以用自定义的SerDe或使用Hive自带的SerDe，SerDe能为表指定列，且对列指定相应的数据。
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]
创建指定SerDe表时，使用row format row_format参数，例如：
a、添加jar包。在hive客户端输入：hive>add jar /run/serde_test.jar; 或者在linux shell端执行命令：${HIVE_HOME}/bin/hive -auxpath /run/serde_test.jar b、建表：create table serde_table row format serde 'hive.connect.TestDeserializer';
3、编写序列化类TestDeserializer。实现Deserializer接口的三个函数：

a）初始化：initialize(Configuration conf, Properties tb1)。

b）反序列化Writable类型返回Object:deserialize(Writable blob)。

c）获取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。

[java] view plain copy print ?

packagecom.sina.hive.test;
importjava.util.Properties;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.hive.serde2.SerDeException;
importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
importorg.apache.hadoop.io.Writable;
publicinterfaceDeserializer{
/**
**InitializetheHiveDeserializer.*@paramconfSystemproperties*@param
*tbltableproperties*@throwsSerDeException
*/
publicvoidinitialize(Configurationconf,Propertiestbl)
throwsSerDeException;
/**
**DeserializeanobjectoutofaWritableblob.*Inmostcases,the
*returnvalueofthisfunctionwillbeconstantsincethefunction*will
*reusethereturnedobject.*Iftheclientwantstokeepacopyofthe
*object,theclientneedstoclonethe*returnedvaluebycalling
*ObjectInspectorUtils.getStandardObject().*@paramblobTheWritable
*objectcontainingaserializedobject*@returnAJavaobjectrepresenting
*thecontentsintheblob.
*/
publicObjectdeserialize(Writableblob)throwsSerDeException;
/**
**Gettheobjectinspectorthatcanbeusedtonavigatethroughthe
*internal*structureoftheObjectreturnedfromdeserialize(...).
*/
publicObjectInspectorgetObjectInspector()throwsSerDeException;
}
package com.sina.hive.test;


import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.io.Writable;

public interface Deserializer {
	/**
	 * * Initialize the HiveDeserializer. * @param conf System properties * @param
	 * tbl table properties * @throws SerDeException
	 */
	public void initialize(Configuration conf, Properties tbl)
			throws SerDeException;

	/**
	 * * Deserialize an object out of a Writable blob. * In most cases, the
	 * return value of this function will be constant since the function * will
	 * reuse the returned object. * If the client wants to keep a copy of the
	 * object, the client needs to clone the * returned value by calling
	 * ObjectInspectorUtils.getStandardObject(). * @param blob The Writable
	 * object containing a serialized object * @return A Java object representing
	 * the contents in the blob.
	 */
	public Object deserialize(Writable blob) throws SerDeException;

	/**
	 * * Get the object inspector that can be used to navigate through the
	 * internal * structure of the Object returned from deserialize(...).
	 */
	public ObjectInspector getObjectInspector() throws SerDeException;
}
实现一行数据划分成hive表的time,userid,host,path四个字段的反序列化类。例如：

[java] view plain copy print ?

packagecom.sina.hive.test;
importjava.net.MalformedURLException;
importjava.net.URL;
importjava.util.ArrayList;
importjava.util.List;
importjava.util.Properties;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.hive.serde2.Deserializer;
importorg.apache.hadoop.hive.serde2.SerDeException;
importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.io.Writable;
publicclassTestDeserializerimplementsDeserializer{
privatestaticList<String>FieldNames=newArrayList<String>();
privatestaticList<ObjectInspector>FieldNamesObjectInspectors=newArrayList<ObjectInspector>();
static{
FieldNames.add("time");
FieldNamesObjectInspectors.add(ObjectInspectorFactory
.getReflectionObjectInspector(Long.class,
ObjectInspectorOptions.JAVA));
FieldNames.add("userid");
FieldNamesObjectInspectors.add(ObjectInspectorFactory
.getReflectionObjectInspector(Integer.class,
ObjectInspectorOptions.JAVA));
FieldNames.add("host");
FieldNamesObjectInspectors.add(ObjectInspectorFactory
.getReflectionObjectInspector(String.class,
ObjectInspectorOptions.JAVA));
FieldNames.add("path");
FieldNamesObjectInspectors.add(ObjectInspectorFactory
.getReflectionObjectInspector(String.class,
ObjectInspectorOptions.JAVA));
}
@Override
publicObjectdeserialize(Writableblob){
try{
if(blobinstanceofText){
Stringline=((Text)blob).toString();
if(line==null)
returnnull;
String[]field=line.split("/t");
if(field.length!=3){
returnnull;
}
List<Object>result=newArrayList<Object>();
URLurl=newURL(field[2]);
Longtime=Long.valueOf(field[0]);
Integeruserid=Integer.valueOf(field[1]);
result.add(time);
result.add(userid);
result.add(url.getHost());
result.add(url.getPath());
returnresult;
}
}catch(MalformedURLExceptione){
e.printStackTrace();
}
returnnull;
}
@Override
publicObjectInspectorgetObjectInspector()throwsSerDeException{
returnObjectInspectorFactory.getStandardStructObjectInspector(
FieldNames,FieldNamesObjectInspectors);
}
@Override
publicvoidinitialize(Configurationarg0,Propertiesarg1)
throwsSerDeException{
}
}
package com.sina.hive.test;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde2.Deserializer;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

public class TestDeserializer implements Deserializer {
	private static List<String> FieldNames = new ArrayList<String>();
	private static List<ObjectInspector> FieldNamesObjectInspectors = new ArrayList<ObjectInspector>();
	static {
		FieldNames.add("time");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(Long.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("userid");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(Integer.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("host");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(String.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("path");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(String.class,
						ObjectInspectorOptions.JAVA));
	}

	@Override
	public Object deserialize(Writable blob) {
		try {
			if (blob instanceof Text) {
				String line = ((Text) blob).toString();
				if (line == null)
					return null;
				String[] field = line.split("/t");
				if (field.length != 3) {
					return null;
				}
				List<Object> result = new ArrayList<Object>();
				URL url = new URL(field[2]);
				Long time = Long.valueOf(field[0]);
				Integer userid = Integer.valueOf(field[1]);
				result.add(time);
				result.add(userid);
				result.add(url.getHost());
				result.add(url.getPath());
				return result;
			}
		} catch (MalformedURLException e) {
			e.printStackTrace();
		}
		return null;
	}

	@Override
	public ObjectInspector getObjectInspector() throws SerDeException {
		return ObjectInspectorFactory.getStandardStructObjectInspector(
				FieldNames, FieldNamesObjectInspectors);
	}

	@Override
	public void initialize(Configuration arg0, Properties arg1)
			throws SerDeException {
	}
}
测试HDFS上hive表数据，如下为一条测试数据：

1234567891012 123456http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF
hive> add jar /run/jar/merg_hua.jar; Added /run/jar/merg_hua.jar toclasspath hive> create table serde_table row format serde 'hive.connect.TestDeserializer'; Foundclassforhive.connect.TestDeserializer OK Time taken: 0.028 seconds hive> describe serde_table; OK time bigint from deserializer useridintfrom deserializer host string from deserializer path string from deserializer Time taken: 0.042 seconds hive> select * from serde_table; OK 1234567891012 123456 wiki.apache.org /hadoop/Hive/LanguageManual/UDF Time taken: 0.039 seconds
三、总结
1、创建Hive表使用序列化时，需要自写一个实现Deserializer的类，并且选用create命令的row format参数。

2、在处理海量数据的时候，如果数据的格式与表结构吻合，可以用到Hive的反序列化而不需要对数据进行转换，可以节省大量的时间。