大数据_hive_基本操作

DDL


create

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

列数据类型data_type

参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)

primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)

array_type
  : ARRAY < data_type >

map_type
  : MAP < primitive_type, data_type >

struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)

字段分隔符ROW FORMAT row_format

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
Hadoop是一个分布式计算框架,而Hive是建立在Hadoop之上的一个数据仓库工具,它提供了SQL-like查询接口,使得非技术人员也能方便地处理大数据Hive的主要作用是将结构化的大型数据集转化为适合于分析的操作环境,包括词频统计。 词频统计在Hive中通常用于文本数据的分析,例如日志文件中的单词频率。以下是基本的Hive操作步骤: 1. **加载数据**: 首先,你需要将数据加载到Hive表中,这通常是通过`LOAD DATA INPATH`命令从本地文件系统或HDFS(Hadoop分布式文件系统)中导入。 ```sql LOAD DATA LOCAL INPATH '/path/to/your/textfile.txt' INTO TABLE your_table; ``` 2. **创建外部表**: 如果数据源不是HDFS,可以考虑创建外部表,这样Hive不会尝试将数据加载到Hive元数据中,节省空间并提高性能。 ```sql CREATE EXTERNAL TABLE your_table (lines STRING); ``` 3. **字段解析和分词**: 使用`REGEXP_EXTRACT`或`SPLIT`函数将文本分割成单词,假设我们使用空格作为分隔符。 ```sql SELECT word, COUNT(*) as frequency FROM ( SELECT split(lines, ' ') as word FROM your_table ) words GROUP BY word; ``` 4. **去重和排序**: 为了得到每个单词及其频率,需要去除重复的单词并按频率排序。 ```sql SELECT word, COUNT(*) as frequency FROM ( SELECT DISTINCT word FROM ( SELECT split(lines, ' ') as word FROM your_table ) words ) grouped_words GROUP BY word ORDER BY frequency DESC; ``` 5. **查询结果**: 最后,你可以运行这个查询来获取结果,Hive会返回一个包含单词和其出现次数的表格,按频率降序排列。 ```sql SELECT * FROM ( SELECT word, COUNT(*) as frequency ... ) ORDERED_results; ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值