Hive学习笔记-分隔符处理

最新推荐文章于 2024-10-14 17:13:11 发布

转载最新推荐文章于 2024-10-14 17:13:11 发布 · 1.2k 阅读

hive 专栏收录该内容

18 篇文章

订阅专栏

本文介绍如何在Hive中使用多字符作为数据分隔符，包括自定义InputFormat与OutputFormat的方法及通过SerDe实现数据序列化和反序列化的方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

hive默认是只支持单字符的分隔符，默认单字符是\001。当然你也可以在创建表格时指定数据的分割符号。如：

create table user(name string, password string) row format delimited fields terminated by '\t'。

通过这种方式，完成分隔符的指定。

如果你想要支持多字符的分隔符可以通过如下方式：

1、自定义一个 InputFormat ，重写 InputFormat 中 RecordReader 类中的 next 方法.。当然这是输入的时候调用的，输出的时候也是可以设定不同的分隔符的。方法和输入一样，自定义一个OutputFormat，不过这里要注意的是：自定义的OutputFormat必须要实现HiveOutputFormat接口，重写 OutputFormat 中 RecordWriter 中的 write 方法，这里可以参考HiveIgnoreKeyTextOutputFormat类。Hive的InputFormat/OutputFormat与Hadoop 的InputFormat/OutputFormat相当类似，InputFormat负责把输入的数据进行格式化或转换处理，然后提供给Hive，OutputFormat 负责把 Hive输出的数据重新格式化成目标格式再输出到文件，这种对格式进行定制的方式较为底层。重写完成后打包成jar，放入到Hive目录的lib文件夹下面。

[java] view plain copy

public synchronized boolean next(LongWritable key, Text value)
throws IOException {
while (pos < end) {
key.set(pos);
int newSize = lineReader.readLine(value, maxLineLength,
Math.max((int) Math.min(Integer.MAX_VALUE, end - pos), maxLineLength));
if (newSize == 0) return false;
String str = value.toString().toLowerCase().replaceAll("::", ":");
value.set(str);
pos += newSize;
if (newSize < maxLineLength) return true;
}
return false;
}

[java] view plain copy

@Override
public void write(Writable r) throws IOException {
if (r instanceof Text) {
Text tr = (Text) r;
String strReplace = tr.toString().toLowerCase().replace(":", "::");
Text txtReplace = new Text();
txtReplace.set(strReplace);
outStream.write(txtReplace.getBytes(), 0, txtReplace.getLength());
outStream.write(finalRowSeparator);
} else {
BytesWritable bw = (BytesWritable) r;
outStream.write(bw.get(), 0, bw.getSize());
outStream.write(finalRowSeparator);
}
}

需要重新进入shell模式，在创建表的时候如下操作：

[sql] view plain copy

create table user(username string,password string)
row format delimited
fields terminated by ':'
stored as
INPUTFORMAT 'org.platform.utils.bigdata.hive.CustomInputFormat'
OUTPUTFORMAT 'org.platform.utils.bigdata.hive.CustomOutputFormat';

2、通过 SerDe(serialize/deserialize) ，在数据序列化和反序列化时格式化数据。这种方式比较复杂一点，对数据的控制能力也要弱一些，它使用正则表达式来匹配和处理数据，性能也会有所影响。但它的优点是可以自定义表属性信息SERDEPROPERTIES，在 SerDe 中通过这些属性信息可以有更多的定制行为。参考示例：

[sql] view plain copy