直接将文件put到Hdfs后建Hive表查询数据

最新推荐文章于 2023-06-07 20:38:28 发布

原创最新推荐文章于 2023-06-07 20:38:28 发布 · 1.1w 阅读

10 ·

CC 4.0 BY-SA版权

[1]Hive 专栏收录该内容

21 篇文章

订阅专栏

本文介绍如何将本地文件导入HDFS，并通过Hive表进行数据查询。首先创建一个Hive空表并指定存储格式，然后将文件上传至指定HDFS路径下，通过设置表分区来使用这些数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

直接将文件put到Hdfs后建Hive表查询数据

由于业务需要，有时需要将其他地方的文件数据导入HDFS，然后建Hive表进行数据查询或数据业务统计。这里演示如何将本地文件先put到Hive已存在的空表中，然后查询数据。

1-先建立一个空表

CREATE TABLE `hive_test.direct_load_file_into_table`(
  `id` int,
  `name` string)
PARTITIONED BY (`year` string COMMENT '年', `month` string COMMENT '月', `day` string COMMENT '日')
row format delimited fields terminated by '\t'
STORED AS textfile;

然后show create table，确认下该该表的HDFS路径，比如/user/hive/warehouse/hive_test.db/direct_load_file_into_table

CREATE TABLE `direct_load_file_into_table`(
  `id` int,
  `name` string)
PARTITIONED BY (
  `year` string COMMENT '年',
  `month` string COMMENT '月',
  `day` string COMMENT '日')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'='\t',
  'serialization.format'='\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://xxxxxx:9000/user/hive/warehouse/hive_test.db/direct_load_file_into_table'
TBLPROPERTIES (
  'transient_lastDdlTime'='1523536404')

2-向该路径下put文件

这里将本地两个文件，put到上一步所建表的hdfs路径下，同时演示文件夹迭代比如这里除了分区2018/04/01，额外添加了一层文件夹01和02

hadoop fs -mkdir -p  /user/hive/warehouse/hive_test.db/direct_load_file_into_table/2018/04/01/01
hadoop fs -mkdir -p  /user/hive/warehouse/hive_test.db/direct_load_file_into_table/2018/04/01/02

hadoop fs -ls /user/hive/warehouse/hive_test.db/direct_load_file_into_table/2018/04/01
drwxr-xr-x   - hjw01 supergroup          0 2018-04-12 20:12 /user/hive/warehouse/hive_test.db/direct_load_file_into_table/2018/04/01/01
drwxr-xr-x   - hjw01 supergroup          0 2018-04-12 20:12 /user/hive/warehouse/hive_test.db/direct_load_file_into_table/2018/04/01/02

本地文件如下

[hjw01@f0d97899be95 data]$ more 01.txt
101	张三
102	李四
103	王五
[hjw01@f0d97899be95 data]$ more 02.txt
201	张三
202	李四
203	王五

将本地文件put到hdfs中

hadoop fs -put  01.txt /user/hive/warehouse/hive_test.db/direct_load_file_into_table/2018/04/01/01
hadoop fs -put  02.txt /user/hive/warehouse/hive_test.db/direct_load_file_into_table/2018/04/01/02

3-建立表的分区

建立分区就可以直接使用分区下的数据

alter table hive_test.direct_load_file_into_table add if not exists   partition(year='2018',month='04',day='01') location '2018/04/01';

4-查询验证数据

这里我们顺带演示了文件多层迭代，需要set 参数，具体如下

set hive.mapred.supports.subdirectories=true;
set mapreduce.input.fileinputformat.input.dir.recursive=true;
select 
* 
from
hive_test.direct_load_file_into_table
where
concat(year,month,day) = '20180401'

查询结果

hive> set hive.mapred.supports.subdirectories=true;
hive> set mapreduce.input.fileinputformat.input.dir.recursive=true;
hive> select
    > *
    > from
    > hive_test.direct_load_file_into_table
    > where
    > concat(year,month,day) = '20180401';
OK
101	张三	2018	04	01
102	李四	2018	04	01
103	王五	2018	04	01
201	张三	2018	04	01
202	李四	2018	04	01
203	王五	2018	04	01
Time taken: 1.775 seconds, Fetched: 6 row(s)