Fayson的github: https://github.com/fayson/cdhproject
推荐关注微信公众号:“Hadoop实操”,ID:gh_c4c535955d0f,或者扫描文末二维码。
1 问题描述
如何将多个字符作为字段分割符的数据文件加载到Hive表中,事例数据如下:
字段分隔符为“@#$”
test1@#$test1name@#$test2value
test2@#$test2name@#$test2value
test3@#$test3name@#$test4value
如何将上述事例数据加载到Hive表(multi_delimiter_test)中,表结构如下:
字段名 | 字段类型 |
---|---|
s1 | String |
s2 | String |
s3 | String |
2 Hive多分隔符支持
Hive在0.14及以后版本支持字段的多分隔符,参考:MultiDelimitSerDe
3 实现方式
- 测试环境说明
测试环境为CDH5.11.1
Hive版本为1.1.0
操作系统为RedHat6.5
- 操作步骤
1.准备多分隔符文件并装载到HDFS对应目录
[ec2-user@ip-172-31-8-141 ~]$ cat multi_delimiter_test.dat
test1@#$test1name@#$test2value
test2@#$test2name@#$test2value
test3@#$test3name@#$test4value
[ec2-user@ip-172-31-8-141 ~]$ hadoop dfs -put multi_delimiter_test.dat /fayson/multi_delimiter_test
[ec2-user@ip-172-31-8-141 ~]$ hadoop dfs -ls /fayson/multi_delimiter_test
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 1 items
-rw-r--r-- 3 user_r supergroup 93 2017-08-23 03:24 /fayson/multi_delimiter_test/multi_delimiter_test.dat
[ec2-user@ip-172-31-8-141 ~]$
2.基于准备好的多分隔符文件建表
create external table multi_delimiter_test(
s1 string,
s2 string,
s3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEP