本文为自己翻译的译文,原文地址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Hive Data Definition Language
概述
这里是HiveQL DDL语法说明文档 包括:
- CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX
- DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
- TRUNCATE TABLE
- ALTER DATABASE/SCHEMA, TABLE, VIEW
- MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)
- SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS, PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE
- DESCRIBE DATABASE/SCHEMA, table_name, view_name
PARTITION 语句通常是TABLE语句的选项, SHOW PARTITIONS除外。
Keywords, Non-reserved Keywords and Reserved Keywords
所有关键字 |
||
---|---|---|
版本号 | 非保留关键字 |
保留的关键字 |
Hive 1.2.0 |
|
|
Hive 2.0.0 |
removed: added: |
added: |
Hive 2.1.0 |
added: |
added: |
Hive 2.2.0 |
added: |
added: |
Hive 3.0.0 |
added: TIMESTAMPTZ, ZONE |
added: TIME, NUMERIC |
版本信息
REGEXP and RLIKE are non-reserved keywords prior to Hive 2.0.0 and reserved keywords starting in Hive 2.0.0 (HIVE-11703).
如果您像支持引用标识符的列名中所描述的那样引用它们,那么保留的关键字允许被作为标识符(0.13.0 及之后的版本,请查阅 HIVE-6013).。 大多数关键字是通过HIVE-6617保留的,以减少语法上的歧义(version 1.2.0 and later)。仍然有两种方法供用户想使用这些保留关键字标识符:(1)使用引用标识符;(2)设置hive.support.sql11.reserved.keywords = false。(version 2.1.0 and earlier)
Create/Drop/Alter/Use Database
Create Database
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
|
SCHEMA 和DATABASE 的使用是可互换的——它们的意思是一样的。CREATE DATABASE 是在Hive 0.6中增加的(HIVE-675). WITH DBPROPERTIES 子句是在Hive 0.7中增加的 (HIVE-1836).
Drop Database
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
|
SCHEMA 和DATABASE 的使用是可互换的——它们的意思是一样的。 DROP DATABASE 是在Hive 0.6中增加的 (HIVE-675)。默认行为是限制的,如果数据库不是空的,那么DROP数据库将会失败。 To drop the tables in the database as well, use DROP DATABASE ... CASCADE(级联). 若要删除数据库中的表,请使用 DROP DATABASE ... CASCADE。在Hive 0.8中添加了对RESTRICT和CASCADE的支持。(HIVE-2090).
Alter Database
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
-- (Note: SCHEMA added in Hive
0.14
.
0
)
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;
-- (Note: Hive
0.13
.
0
and later; SCHEMA added in Hive
0.14
.
0
)
ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;
-- (Note: Hive
2.2
.
1
,
2.4
.
0
and later)
|
SCHEMA 和DATABASE 的使用是可互换的——它们的意思是一样的。ALTER SCHEMA是在Hive 0.6中增加的(HIVE-6601).
ALTER DATABASE ... SET LOCATION语句不会移动数据库的内容到指定的新目录中。它不改变指定数据库中任何表、分区的相关路径。它只改变默认的父目录,只会更改数据库中新增加的表。这个操作类似于改变表的位置但是不移动现存的分区位置。
数据库的其它元数据都不可以改变。
Use Database
USE database_name;
USE DEFAULT;
|
USE 为所有后续的HiveQL语句设置当前数据库。要还原到默认数据库,使用关键字“default”而不是数据库名。检查当前正在使用哪个数据库: SELECT current_database()
(as of Hive 0.13.0).
USE database_name
是在 Hive 0.6 (HIVE-675)中增加的。
Create/Drop/Truncate Table
Create Table
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive
0.10
.
0
and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY
'storage.handler.class.name'
[WITH SERDEPROPERTIES (...)]
-- (Note: Available in Hive
0.6
.
0
and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive
0.6
.
0
and later)
[AS select_statement]; -- (Note: Available in Hive
0.5
.
0
and later; not supported
for
external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive
0.7
.
0
and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive
2.2
.
0
and later)
| STRING
| BINARY -- (Note: Available in Hive
0.8
.
0
and later)
| TIMESTAMP -- (Note: Available in Hive
0.8
.
0
and later)
| DECIMAL -- (Note: Available in Hive
0.11
.
0
and later)
| DECIMAL(precision, scale) -- (Note: Available in Hive
0.13
.
0
and later)
| DATE -- (Note: Available in Hive
0.12
.
0
and later)
| VARCHAR -- (Note: Available in Hive
0.12
.
0
and later)
| CHAR -- (Note: Available in Hive
0.13
.
0
and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive
0.7
.
0
and later)
row_format
: DELIMITED [FIELDS TERMINATED BY
char
[ESCAPED BY
char
]] [COLLECTION ITEMS TERMINATED BY
char
]
[MAP KEYS TERMINATED BY
char
] [LINES TERMINATED BY
char
]
[NULL DEFINED AS
char
] -- (Note: Available in Hive
0.13
and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.
default
.fileformat configuration)
| RCFILE -- (Note: Available in Hive
0.6
.
0
and later)
| ORC -- (Note: Available in Hive
0.11
.
0
and later)
| PARQUET -- (Note: Available in Hive
0.13
.
0
and later)
| AVRO -- (Note: Available in Hive
0.14
.
0
and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
|
CREATE TABLE 创建一个指定名字的表格。如果具有相同名称的表或视图已经存在,则会抛出错误。您可以增加IF NOT EXISTS来避免报错。
- 表名和列名是不区分大小写的,但是SerDe和属性名是区分大小写的。
- 在Hive 0.12及更早的版本中只有字母数字和下划线字符可以在表和列名中使用。
- 在Hive 0.13及之后的版本中,列名可以包含任何Unicode字符(参见HIVE-6013),但是点(.) 和冒号(:)的使用会在查询时报错,所以他们在Hive 1.2.0中是不被允许使用的。重音符(
`
)指定的列名都会被按字面意思处理。使用双重音符(``
) 来表示一个重音符。重音符的使用还允许对表和列标识符使用保留的关键字。 - 要恢复到0.13.0之前的规则,并将列名限制为字母数字和下划线字符,设置配置属性 hive.support.quoted.identifiers 为none。在这个配置中重音符会被像普通字符一样对待。详情请查阅 Supporting Quoted Identifiers in Column Names.
- 表格和列注释是字符串常量(单引号)。
- 没有EXTERNAL 子句创建的表被称为托管表,因为Hive管理它的数据。为了查明表是否被管理或外部,请在DESCRIBE EXTENDED table_name的输出中查看tableType。
- TBLPROPERTIES子句允许您用自己的元数据键/值对标记表定义。还存在一些预定义的表属性,比如last_modified_user和last_modified_time ,它们是由Hive自动添加和管理的。其他预定义的表属性包括:
- TBLPROPERTIES ("comment"="table_comment")
- TBLPROPERTIES ("hbase.table.name"="table_name") – see HBase Integration.
- TBLPROPERTIES ("immutable"="true") or ("immutable"="false") in release 0.13.0+ (HIVE-6406) – see Inserting Data into Hive Tables from Queries.
- TBLPROPERTIES ("orc.compress"="ZLIB") or ("orc.compress"="SNAPPY") or ("orc.compress"="NONE") and other ORC properties – see ORC Files.
- TBLPROPERTIES ("transactional"="true") or ("transactional"="false") in release 0.14.0+, the default is "false" – see Hive Transactions.
- TBLPROPERTIES ("NO_AUTO_COMPACTION"="true") or ("NO_AUTO_COMPACTION"="false"), the default is "false" – see Hive Transactions.
- TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="mapper_memory") – see Hive Transactions.
- TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.num.threshold"="threshold_num") – see Hive Transactions.
- TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.pct.threshold"="threshold_pct") – see Hive Transactions.
- TBLPROPERTIES ("auto.purge"="true") or ("auto.purge"="false") in release 1.2.0+ (HIVE-9118) – see Drop Table, Drop Partitions, Truncate Table, and Insert Overwrite.
- TBLPROPERTIES ("EXTERNAL"="TRUE") in release 0.6.0+ (HIVE-1329) – Change a managed table to an external table and vice versa for "FALSE".
- As of Hive 2.4.0 (HIVE-16324) the value of the property 'EXTERNAL' is parsed as a boolean (case insensitive true or false) instead of a case sensitive string comparison.
- 要为某表指定一个数据库,要么在CREATE TABLE语句(Hive.0.6及更高版本)之前使用USE database_name语句,要么用数据库名称(“databasename.TABLE.name”在Hive 0.7及更高版本)来指定表名。
关键字 "default
"被用于默认的数据库。
请参阅下面的Alter Table,以获得关于表注释、表属性和SerDe属性的更多信息。
有关原始和复杂数据类型的详细信息,请参阅类型系统和Hive数据类型。
Managed and External Tables管理和外部表
默认情况下,Hive会创建管理表,其中文件、元数据和统计信息是由内部的Hive进程管理的。管理表存储在hive.metastore.warehouse.dir的路径属性下,默认在一个类似于/app/hi/warehouse/databasename.db/tablename/的文件夹路径中。默认位置可以在表创建期间被位置属性覆盖。如果删除了管理表或分区,则删除与该表或分区相关联的数据和元数据。如果没有指定PURGE 选项,则数据会被移动到一个垃圾文件夹中,以确定持续时间。
当需要用Hive管理表的生命周期时使用管理表,否则使用临时表。.
外部表描述外部文件的元数据/模式。 外部表文件可以由Hive之外的进程访问和管理。外部表可以访问储存在诸如Azure存储卷(ASV)或远程HDFS位置的数据。如果外部表的结构或分区发生了变化,那么可以使用MSCK REPAIR TABLE table_name语句来刷新元数据信息。
当文件已经存在或在远程位置时,使用外部表,即使表被删除,文件也会保留。
管理的或外部的表可以使用 DESCRIBE FORMATTED table_name命令来识别,它将根据表类型显示 是MANAGED_TABLE或EXTERNAL_TABLE。
统计信息可以在内部和外部的表和分区上进行管理,以进行查询优化。
Storage Formats
Hive支持内置的和定制的文件格式。有关压缩表存储的详细信息,请参阅压缩存储。
下面是一些内置到Hive的格式:
Storage Format
|
Description
|
---|---|
STORED AS TEXTFILE | Stored as plain text files. TEXTFILE is the default file format, unless the configuration parameter hive.default.fileformat has a different setting. Use the DELIMITED clause to read delimited files. Enable escaping for the delimiter characters by using the 'ESCAPED BY' clause (such as ESCAPED BY '\') |
STORED AS SEQUENCEFILE | Stored as compressed Sequence File. |
STORED AS ORC | Stored as ORC file format. Supports ACID Transactions & Cost-based Optimizer (CBO). Stores column-level metadata. |
STORED AS PARQUET | Stored as Parquet format for the Parquet columnar storage format in Hive 0.13.0 and later; Use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT syntax ... in Hive 0.10, 0.11, or 0.12. |
STORED AS AVRO | Stored as Avro format in Hive 0.14.0 and later (see Avro SerDe). |
STORED AS RCFILE | Stored as Record Columnar File format. |
STORED BY | Stored by a non-native table format. To create or link to a non-native table, for example a table backed by HBase or Druid or Accumulo. See StorageHandlers for more information on this option. |
INPUTFORMAT and OUTPUTFORMAT | in the file_format to specify the name of a corresponding InputFormat and OutputFormat class as a string literal. For example, 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For LZO compression, the values to use are 'INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"' (see LZO Compression). |
Row Formats & SerDe
您可以使用定制的SerDe或使用本地SerDe来创建表。 如果没有指定ROW FORMAT或者ROW FORMAT DELIMITED,则使用本地SerDe。
使用SerDe子句来创建一个带有定制SERDE的表。有关SerDes的更多信息:
您必须为使用SerDe的表格定义好列的属性。请参考用户指南的数据类型部分中允许的列类型的。
虽然可以指定使用自定义的SerDe的表的列列表但是Hive会查询SerDe以决定该表的实际列列表。
有关SerDes的一般信息,请参阅开发人员指南中的Hive SerDe。请参阅SerDe了解关于输入和输出处理的详细信息。
要更改表格的SerDe或SERDEPROPERTIES,请使用下面描述的ALTER TABLE语句,添加SerDe属性。
Row Format
|
Description |
---|---|
RegEx |
Stored as plain text file, translated by Regular Expression. The following example defines a table in the default Apache Weblog format.
CREATE
TABLE
apachelog (
host STRING,
identity STRING,
user
STRING,
time
STRING,
request STRING,
status STRING,
size
STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH
SERDEPROPERTIES (
"input.regex"
=
"([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED
AS
TEXTFILE;
|
JSON
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE |
Stored as plain text file in JSON format.
In some distributions, a reference to hive-hcatalog-core.jar is required.
ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;
TABLE
my_table(a string, b
bigint
, ...)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED
AS
TEXTFILE;
The JsonSerDe was moved to Hive from HCatalog and before it was in hive-contrib project. It was added to the Hive distribution by HIVE-4895. An Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar for releases prior to 0.12.0.
The JsonSerDe for JSON files is available in Hive 0.12 and later. |
CSV/TSV ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE |
Stored as plain text file in CSV / TSV format.
The CSVSerde is available in
Hive 0.14 and greater.
The following example creates a TSV (Tab-separated) file.
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH
SERDEPROPERTIES (
"separatorChar"
=
"\t"
,
"quoteChar"
=
"'"
,
"escapeChar"
=
"\\"
)
STORED
AS
TEXTFILE;
Default properties for SerDe is Comma-Separated (CSV) file
DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER "
DEFAULT_SEPARATOR ,
This SerDe works for most CSV data, but does not handle embedded newlines. To use the SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde. Documentation is based on original documentation at https://github.com/ogrodnek/csv-serde.
This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type.
The type information is retrieved from the SerDe. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.
The CSV SerDe is based on
https://github.com/ogrodnek/csv-serde, and was added to the Hive distribution in
HIVE-7777.
The CSVSerde has been built and tested against Hive 0.14 and later, and uses
Open-CSV 2.3 which is bundled with the Hive distribution.
For general information about SerDes, see Hive SerDe in the Developer Guide. Also see SerDe for details about input and output processing. |
Partitioned Tables
分区表可以使用PARTITIONED BY子句创建。 一个表可以有一个或多个分区列,并且为分区列中的每一个不同的值组合创建一个单独的数据目录。 一个表可以有一个或多个分区列,并且为分区列中的每一个不同的值组合创建一个单独的数据目录。而且,可以通过列对表或分区进行分类,并且可以通过列对数据进行排序。这可以提高某些类型查询的性能。
如果在创建分区表时,您会得到以下错误: "FAILED: Error in semantic analysis: Column repeated in partitioning columns," 这意味着您正在尝试将分区列包含在表格本身的数据中。你可能确实有定义了列。然而,您所创建的分区构成了一个可以查询的伪列,因此您必须将表列重命名为别的东西(用户不应该查询!)
例如,假设您原来的未分区表有三列:id、日期和名称。
|