第10课 Hive 安装和使用教程

最新推荐文章于 2024-07-17 10:41:39 发布

转载最新推荐文章于 2024-07-17 10:41:39 发布 · 352 阅读

hadoop 专栏收录该内容

54 篇文章

订阅专栏

本文详细介绍Hive的安装步骤及使用方法，包括Hive的基本概念、安装过程、metastore服务配置、不同类型的表（如内部表、分区表、桶表、外部表）的创建与使用，并给出具体实例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

声明

本文基于Centos 6.x + CDH 5.x

Hive是什么

Hive 提供了一个让大家可以使用sql去查询数据的途径。让大家可以在hadoop上写sql语句。但是最好不要拿Hive进行实时的查询。因为Hive的实现原理是把sql语句转化为多个Map Reduce任务所以Hive非常慢，官方文档说Hive 适用于高延时性的场景而且很费资源。

举个简单的例子，可以像这样去查询

[plain]view plain copy
hive> select * from h_employee;  
OK  
1   1   peter  
2   2   paul  
Time taken: 9.289 seconds, Fetched: 2 row(s)  

这个h_employee不一定是一个数据库表，有可能只是一个针对csv文件的元数据映射。

Hive 安装

相比起很多教程先介绍概念，我喜欢先动手装上，然后用例子来介绍概念。我们先来安装一下Hive

先确认是否已经安装了对应的yum源，如果没有照这个教程里面写的安装cdh的yum源http://blog.youkuaiyun.com/nsrainbow/article/details/36629339

hive 基本包

[plain]view plain copy
yum install hive -y  

hive metastore

[plain]view plain copy
yum install hive-metastore  

hive服务端

[plain]view plain copy
yum install hive-server2 -y  

如果要跟hbase通讯就安装 hive-hbase

[plain]view plain copy
yum install hive-hbase -y  

Hive metastore 服务

3种模式

hive metastore(元数据) 服务用来存储 Hive 表的元数据和分区。下面会介绍metastore的概念，现在先搞定安装再说。hive 存储 metastore有3种模式

内置存储模式

内置存储

用的是derby作为数据库，但是这个derby很挫啊，一个纯java的数据库，同时只能有一个会话，存粹测试玩玩。所以我们说下第二种模式

本地存储模式

在这种模式下，hive metastore 服务跟HiveServer进程共用一个进程，但是会另起一个线程来运行元数据数据库，这个线程有可能在另外一个机器上。内置的metastore服务跟metastore数据库之间通过JDBC交互。比上一个方案更进一步了，但是还是不够好，因为hive metastore跟HiveServer还共用一个进程呢，于是来介绍下CDH强烈推荐的第三种模式

远程模式

在这种模式下，Hive metastore 服务运行在独立的jvm进程里面。 HiveServer2, HCatalog, Cloudera Impala™, 和其他进程通过 Thrift 的网络 API (在 hive.metastore.uris 属性里面配置)来跟它通讯。metastore 服务跟存储 metastore 的数据库之间通过JDBC (用 javax.jdo.option.ConnectionURL 属性配置)通讯. 数据库， HiveServer 进程，和 metastore 服务可以运行在同一个机子上，但是如果把 HiveServer进程运行在另一台机器上会更高的可用性（就是不要把鸡蛋放在一个篮子里啦）和扩展性。

使用mysql作为metastore数据库

我们选择mysql作为metastore的数据库

安装mysql

如果你的机器上已经安装过mysql可以跳过这一步

[plain]view plain copy
yum install mysql-server  

启动服务

[plain]view plain copy
service mysqld start  

添加到自启动

[plain]view plain copy
chkconfig mysqld on  

初始化mysql的一些参数，比如root用户的密码等

[plain]view plain copy
$ sudo /usr/bin/mysql_secure_installation  
[...]  
Enter current password for root (enter for none):  
OK, successfully used password, moving on...  
[...]  
Set root password? [Y/n] y  
New password:  
Re-enter new password:  
Remove anonymous users? [Y/n] Y  
[...]  
Disallow root login remotely? [Y/n] N  
[...]  
Remove test database and access to it [Y/n] Y  
[...]  
Reload privilege tables now? [Y/n] Y  
All done!  

安装mysql JDBC驱动

[plain]view plain copy
$ sudo yum install mysql-connector-java  
$ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar  

第二步是把驱动建立一个软链到hive的lib库里面，让hive可以加载

创建metastore需要的用户和库

创建metastore库

[plain]view plain copy
$ mysql -u root -p  
Enter password:  
mysql> CREATE DATABASE metastore;  
mysql> USE metastore;  
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.13.0.mysql.sql;  

创建hive用户

官方给的例子是

[plain]view plain copy
mysql> CREATE USER 'hive'@'metastorehost' IDENTIFIED BY 'mypassword';  
...  
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'metastorehost';  
mysql> GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'metastorehost';  
mysql> FLUSH PRIVILEGES;  

这边metastorehost换成你metastore的机器的host名字，mypassword换成你想设定的密码
在本例子中是这样

[sql]view plain copy
mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';  
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'%';  
mysql> GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'%';  
mysql> FLUSH PRIVILEGES;  
mysql> quit;  

配置hive

编辑 /usr/lib/hive/conf/hive-site.xml

假设你安装mysql的机器名叫host1，在 javax.jdo.option.ConnectionURL 中配置上jdbc连接
hive.metastore.uris 这个参数必须用ip，不懂为什么
hive.metastore.schema.verification 官方建议用true，官方说新旧版本的hive数据结构差别很大，要打开验证，免得出错

[html]view plain copy
<?xml version="1.0"?>  
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
<configuration>  
  
  <property>  
    <name>javax.jdo.option.ConnectionURL</name>  
    <value>jdbc:mysql://host1/metastore</value>  
    <description>the URL of the MySQL database</description>  
  </property>  
  <property>  
    <name>javax.jdo.option.ConnectionDriverName</name>  
    <value>com.mysql.jdbc.Driver</value>  
  </property>  
  <property>  
    <name>javax.jdo.option.ConnectionUserName</name>  
    <value>hive</value>  
  </property>  
  <property>  
    <name>javax.jdo.option.ConnectionPassword</name>  
    <value>hive</value>  
  </property>  
  <property>  
    <name>datanucleus.autoCreateSchema</name>  
    <value>false</value>  
  </property>  
  <property>  
    <name>datanucleus.fixedDatastore</name>  
    <value>true</value>  
  </property>  
  <property>  
    <name>datanucleus.autoStartMechanism</name>   
    <value>SchemaTable</value>  
  </property>   
  <property>  
    <name>hive.metastore.uris</name>  
    <value>thrift://192.168.199.126:9083</value>  
    <description>IP address (or fully-qualified domain name) and port of the metastore host</description>  
  </property>  
  <property>  
    <name>hive.metastore.schema.verification</name>  
    <value>true</value>  
  </property>  
</configuration>  

配置HiveServer2

编辑 /etc/hive/conf/hive-site.xml 增加或者修改这两项

[html]view plain copy
<property>  
  <name>hive.support.concurrency</name>  
  <description>Enable Hive's Table Lock Manager Service</description>  
  <value>true</value>  
</property>  
  
<property>  
  <name>hive.zookeeper.quorum</name>  
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>  
  <value>host1,host2</value>  
</property>  

如果你修改了zookeeper 的默认端口就增加或修改这个属性

[html]view plain copy
<property>  
  <name>hive.zookeeper.client.port</name>  
  <value>2222</value>  
  <description>  
  The port at which the clients will connect.  
  </description>  
</property>  

启动服务

启动顺序是 hive-metastore -> hive-server2

[plain]view plain copy
service hive-metastore start  
service hive-server2 start  

启动的时候遇到问题

我遇到了一个问题，启动的时候报错

[plain]view plain copy
Starting Hive Metastore Server  
Error creating temp dir in hadoop.tmp.dir /data/hdfs/tmp due to Permission denied  

给 /tmp 文件夹一个写权限就好了

[plain]view plain copy
cd /data/hdfs  
chmod a+rwx tmp  

测试是否安装成功

使用hive进入客户端

[plain]view plain copy
$ hive  
hive>  
hive> show tables;  
OK  
Time taken: 10.345 seconds  

Hive使用

metastore

Hive 中建立的表都叫metastore表。这些表并不真实的存储数据，而是定义真实数据跟hive之间的映射，就像传统数据库中表的meta信息，所以叫做metastore。实际存储的时候可以定义的存储模式有四种：

内部表（默认）
分区表
桶表
外部表

举个例子，这是一个简历内部表的语句

[plain]view plain copy
CREATE TABLE worker(id INT, name STRING)  
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';  

这个语句的意思是建立一个worker的内部表，内部表是默认的类型，所以不用写存储的模式。并且使用逗号作为分隔符存储

建表语句支持的类型

基本数据类型
tinyint / smalint / int /bigint
float / double
boolean
string

复杂数据类型
Array/Map/Struct

没有date /datetime

建完的表存在哪里呢？

在 /user/hive/warehouse 里面，可以通过hdfs来查看建完的表位置

[plain]view plain copy
$ hdfs dfs -ls /user/hive/warehouse  
Found 11 items  
drwxrwxrwt   - root     supergroup          0 2014-12-02 14:42 /user/hive/warehouse/h_employee  
drwxrwxrwt   - root     supergroup          0 2014-12-02 14:42 /user/hive/warehouse/h_employee2  
drwxrwxrwt   - wlsuser  supergroup          0 2014-12-04 17:21 /user/hive/warehouse/h_employee_export  
drwxrwxrwt   - root     supergroup          0 2014-08-18 09:20 /user/hive/warehouse/h_http_access_logs  
drwxrwxrwt   - root     supergroup          0 2014-06-30 10:15 /user/hive/warehouse/hbase_apache_access_log  
drwxrwxrwt   - username supergroup          0 2014-06-27 17:48 /user/hive/warehouse/hbase_table_1  
drwxrwxrwt   - username supergroup          0 2014-06-30 09:21 /user/hive/warehouse/hbase_table_2  
drwxrwxrwt   - username supergroup          0 2014-06-30 09:43 /user/hive/warehouse/hive_apache_accesslog  
drwxrwxrwt   - root     supergroup          0 2014-12-02 15:12 /user/hive/warehouse/hive_employee  

一个文件夹对应一个metastore表

Hive 各种类型表使用

内部表

[sql]view plain copy
CREATE TABLE workers( id INT, name STRING)    
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';  

通过这样的语句就建立了一个内部表叫 workers，并且分隔符是逗号， \054 是ASCII 码

我们可以通过 show tables; 来看看有多少表，其实hive的很多语句是模仿mysql的，当你们不知道语句的时候，把mysql的语句拿来基本可以用。除了limit比较怪，这个后面会说

[plain]view plain copy
hive> show tables;  
OK  
h_employee  
h_employee2  
h_employee_export  
h_http_access_logs  
hive_employee  
workers  
Time taken: 0.371 seconds, Fetched: 6 row(s)  

建立完后，我们试着插入几条数据。这边要告诉大家Hive不支持单句插入的语句，必须批量，所以不要指望能用insert into workers values (1,'jack') 这样的语句插入数据。hive支持的插入数据的方式有两种：

从文件读取数据
从别的表读出数据插入(insert from select)

这里我采用从文件读数据进来。先建立一个叫 worker.csv的文件

[plain]view plain copy
$ cat workers.csv  
1,jack  
2,terry  
3,michael  

用LOAD DATA 导入到Hive的表中

[plain]view plain copy
hive> LOAD DATA LOCAL INPATH '/home/alex/workers.csv' INTO TABLE workers;  
Copying data from file:/home/alex/workers.csv  
Copying file: file:/home/alex/workers.csv  
Loading data to table default.workers  
Table default.workers stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 25, raw_data_size: 0]  
OK  
Time taken: 0.655 seconds  

注意

不要少了那个 LOCAL ， LOAD DATA LOCAL INPATH 跟 LOAD DATA INPATH 的区别是一个是从你本地磁盘上找源文件，一个是从hdfs上找文件
如果加上OVERWRITE可以再导入之前先清空表，比如 LOAD DATA LOCAL INPATH '/home/alex/workers.csv' OVERWRITE INTO TABLE workers;

查询一下数据

[plain]view plain copy
hive> select * from workers;  
OK  
1   jack  
2   terry  
3   michael  
Time taken: 0.177 seconds, Fetched: 3 row(s)  

我们去看下导入后在hive内部表是怎么存的

[plain]view plain copy
# hdfs dfs -ls /user/hive/warehouse/workers/  
Found 1 items  
-rwxrwxrwt   2 root supergroup         25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv  

原来就是原封不动的把文件拷贝进去啊！就是这么土！

我们可以试验再放一个文件 workers2.txt （我故意把扩展名换一个，其实hive是不看扩展名的）

[plain]view plain copy
# cat workers2.txt   
4,peter  
5,kate  
6,ted  

导入

[plain]view plain copy
hive> LOAD DATA LOCAL INPATH '/home/alex/workers2.txt' INTO TABLE workers;  
Copying data from file:/home/alex/workers2.txt  
Copying file: file:/home/alex/workers2.txt  
Loading data to table default.workers  
Table default.workers stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 46, raw_data_size: 0]  
OK  
Time taken: 0.79 seconds  

去看下文件的存储结构

[plain]view plain copy
# hdfs dfs -ls /user/hive/warehouse/workers/  
Found 2 items  
-rwxrwxrwt   2 root supergroup         25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv  
-rwxrwxrwt   2 root supergroup         21 2014-12-08 15:29 /user/hive/warehouse/workers/workers2.txt  

多出来一个workers2.txt

再用sql查询下

[plain]view plain copy
hive> select * from workers;  
OK  
1   jack  
2   terry  
3   michael  
4   peter  
5   kate  
6   ted  
Time taken: 0.144 seconds, Fetched: 6 row(s)  

分区表

分区表是用来加速查询的，比如你的数据非常多，但是你的应用场景是基于这些数据做日报表，那你就可以根据日进行分区，当你要做2014-05-05的报表的时候只需要加载2014-05-05这一天的数据就行了。我们来创建一个分区表来看下

[plain]view plain copy
create table partition_employee(id int, name string)   
partitioned by(daytime string)   
row format delimited fields TERMINATED BY '\054';  

可以看到分区的属性，并不是任何一个列

我们先建立2个测试数据文件，分别对应两天的数据

[plain]view plain copy
# cat 2014-05-05  
22,kitty  
33,lily  
# cat 2014-05-06  
14,sami  
45,micky  

导入到分区表里面

[plain]view plain copy
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-05' INTO TABLE partition_employee partition(daytime='2014-05-05');  
Copying data from file:/home/alex/2014-05-05  
Copying file: file:/home/alex/2014-05-05  
Loading data to table default.partition_employee partition (daytime=2014-05-05)  
Partition default.partition_employee{daytime=2014-05-05} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]  
Table default.partition_employee stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]  
OK  
Time taken: 1.154 seconds  
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-06' INTO TABLE partition_employee partition(daytime='2014-05-06');  
Copying data from file:/home/alex/2014-05-06  
Copying file: file:/home/alex/2014-05-06  
Loading data to table default.partition_employee partition (daytime=2014-05-06)  
Partition default.partition_employee{daytime=2014-05-06} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]  
Table default.partition_employee stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 42, raw_data_size: 0]  
OK  
Time taken: 0.763 seconds  

导入的时候通过 partition 来指定分区。

查询的时候通过指定分区来查询

[plain]view plain copy
hive> select * from partition_employee where daytime='2014-05-05';  
OK  
22  kitty   2014-05-05  
33  lily    2014-05-05  
Time taken: 0.173 seconds, Fetched: 2 row(s)  

我的查询语句并没有什么特别的语法，hive 会自动判断你的where语句中是否包含分区的字段。而且可以使用大于小于等运算符

[plain]view plain copy
hive> select * from partition_employee where daytime>='2014-05-05';  
OK  
22  kitty   2014-05-05  
33  lily    2014-05-05  
14  sami    2014-05-06  
45  mick'   2014-05-06  
Time taken: 0.273 seconds, Fetched: 4 row(s)  

我们去看看存储的结构

[plain]view plain copy
# hdfs dfs -ls /user/hive/warehouse/partition_employee  
Found 2 items  
drwxrwxrwt   - root supergroup          0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-05  
drwxrwxrwt   - root supergroup          0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-06  

我们试试二维的分区表

[plain]view plain copy
create table p_student(id int, name string)   
partitioned by(daytime string,country string)   
row format delimited fields TERMINATED BY '\054';  

查入一些数据

[plain]view plain copy
# cat 2014-09-09-CN   
1,tammy  
2,eric  
# cat 2014-09-10-CN   
3,paul  
4,jolly  
# cat 2014-09-10-EN   
44,ivan  
66,billy  

导入hive

[plain]view plain copy
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-09-CN' INTO TABLE p_student partition(daytime='2014-09-09',country='CN');  
Copying data from file:/home/alex/2014-09-09-CN  
Copying file: file:/home/alex/2014-09-09-CN  
Loading data to table default.p_student partition (daytime=2014-09-09, country=CN)  
Partition default.p_student{daytime=2014-09-09, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]  
Table default.p_student stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]  
OK  
Time taken: 0.736 seconds  
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-CN' INTO TABLE p_student partition(daytime='2014-09-10',country='CN');  
Copying data from file:/home/alex/2014-09-10-CN  
Copying file: file:/home/alex/2014-09-10-CN  
Loading data to table default.p_student partition (daytime=2014-09-10, country=CN)  
Partition default.p_student{daytime=2014-09-10, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]  
Table default.p_student stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 38, raw_data_size: 0]  
OK  
Time taken: 0.691 seconds  
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-EN' INTO TABLE p_student partition(daytime='2014-09-10',country='EN');  
Copying data from file:/home/alex/2014-09-10-EN  
Copying file: file:/home/alex/2014-09-10-EN  
Loading data to table default.p_student partition (daytime=2014-09-10, country=EN)  
Partition default.p_student{daytime=2014-09-10, country=EN} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]  
Table default.p_student stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 59, raw_data_size: 0]  
OK  
Time taken: 0.622 seconds  

看看存储结构

[plain]view plain copy
# hdfs dfs -ls /user/hive/warehouse/p_student  
Found 2 items  
drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09  
drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-10  
# hdfs dfs -ls /user/hive/warehouse/p_student/daytime=2014-09-09  
Found 1 items  
drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09/country=CN  

查询一下数据

[plain]view plain copy
hive> select * from p_student;  
OK  
1   tammy   2014-09-09  CN  
2   eric    2014-09-09  CN  
3   paul    2014-09-10  CN  
4   jolly   2014-09-10  CN  
44  ivan    2014-09-10  EN  
66  billy   2014-09-10  EN  
Time taken: 0.228 seconds, Fetched: 6 row(s)  

[plain]view plain copy
hive> select * from p_student where daytime='2014-09-10' and country='EN';  
OK  
44  ivan    2014-09-10  EN  
66  billy   2014-09-10  EN  
Time taken: 0.224 seconds, Fetched: 2 row(s)  

桶表

桶表是根据某个字段的hash值，来将数据扔到不同的“桶”里面。外国人有个习惯，就是分类东西的时候摆几个桶，上面贴不同的标签，所以他们取名的时候把这种表形象的取名为桶表。桶表表专门用于采样分析

下面这个例子是官网教程直接拷贝下来的，因为分区表跟桶表是可以同时使用的，所以这个例子中同时使用了分区跟桶两种特性

[plain]view plain copy
CREATE TABLE b_student(id INT, name STRING)  
PARTITIONED BY(dt STRING, country STRING)  
CLUSTERED BY(id) SORTED BY(name) INTO 4 BUCKETS  
row format delimited   
    fields TERMINATED BY '\054';  

意思是根据userid来进行计算hash值，用viewTIme来排序存储

做数据跟导入的过程我就不在赘述了，这是导入后的数据

[plain]view plain copy
hive> select * from b_student;  
OK  
1   tammy   2014-09-09  CN  
2   eric    2014-09-09  CN  
3   paul    2014-09-10  CN  
4   jolly   2014-09-10  CN  
34  allen   2014-09-11  EN  
Time taken: 0.727 seconds, Fetched: 5 row(s)  

从4个桶中采样抽取一个桶的数据

[plain]view plain copy
hive> select * from b_student tablesample(bucket 1 out of 4 on id);  
Total MapReduce jobs = 1  
Launching Job 1 out of 1  
Number of reduce tasks is set to 0 since there's no reduce operator  
Starting Job = job_1406097234796_0041, Tracking URL = http://hadoop01:8088/proxy/application_1406097234796_0041/  
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1406097234796_0041  
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
2014-12-08 17:35:56,995 Stage-1 map = 0%,  reduce = 0%  
2014-12-08 17:36:06,783 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.9 sec  
2014-12-08 17:36:07,845 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.9 sec  
MapReduce Total cumulative CPU time: 2 seconds 900 msec  
Ended Job = job_1406097234796_0041  
MapReduce Jobs Launched:   
Job 0: Map: 1   Cumulative CPU: 2.9 sec   HDFS Read: 482 HDFS Write: 22 SUCCESS  
Total MapReduce CPU Time Spent: 2 seconds 900 msec  
OK  
4   jolly   2014-09-10  CN  

外部表

外部表就是存储不是由hive来存储的，比如可以依赖Hbase来存储，hive只是做一个映射而已。我用Hbase来举例

先建立一张Hbase表叫 employee

[plain]view plain copy
hbase(main):005:0> create 'employee','info'    
0 row(s) in 0.4740 seconds    
    
=> Hbase::Table - employee    
hbase(main):006:0> put 'employee',1,'info:id',1    
0 row(s) in 0.2080 seconds    
    
hbase(main):008:0> scan 'employee'    
ROW                                      COLUMN+CELL                                                                                                               
 1                                       column=info:id, timestamp=1417591291730, value=1                                                                          
1 row(s) in 0.0610 seconds    
    
hbase(main):009:0> put 'employee',1,'info:name','peter'    
0 row(s) in 0.0220 seconds    
    
hbase(main):010:0> scan 'employee'    
ROW                                      COLUMN+CELL                                                                                                               
 1                                       column=info:id, timestamp=1417591291730, value=1                                                                          
 1                                       column=info:name, timestamp=1417591321072, value=peter                                                                    
1 row(s) in 0.0450 seconds    
    
hbase(main):011:0> put 'employee',2,'info:id',2    
0 row(s) in 0.0370 seconds    
    
hbase(main):012:0> put 'employee',2,'info:name','paul'    
0 row(s) in 0.0180 seconds    
    
hbase(main):013:0> scan 'employee'    
ROW                                      COLUMN+CELL                                                                                                               
 1                                       column=info:id, timestamp=1417591291730, value=1                                                                          
 1                                       column=info:name, timestamp=1417591321072, value=peter                                                                    
 2                                       column=info:id, timestamp=1417591500179, value=2                                                                          
 2                                       column=info:name, timestamp=1417591512075, value=paul                                                                     
2 row(s) in 0.0440 seconds   

建立外部表进行映射

[plain]view plain copy
hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string)     
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'    
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name")    
    > TBLPROPERTIES ("hbase.table.name" = "employee");    
OK    
Time taken: 0.324 seconds    
hive> select * from h_employee;    
OK    
1   1   peter    
2   2   paul    
Time taken: 1.129 seconds, Fetched: 2 row(s)  

查询语法

具体语法可以参考官方手册https://cwiki.apache.org/confluence/display/Hive/Tutorial

我只说几个比较奇怪的点

显示条数

展示x条数据，用的还是limit，比如

[plain]view plain copy
hive> select * from h_employee limit 1  
    > ;  
OK  
1   1   peter  
Time taken: 0.284 seconds, Fetched: 1 row(s)  

但是不支持起点，比如offset

参考资料

http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_hiveserver2_configure.html