Hive初步使用

目录

 

概述

架构

1:用户接口 client

2:元数据(Metastore)

3:hadoop

4:Driver

Hive的优点和使用场景

Hive的常用命令

1:show databases;

2:create database rachel_db_hive;

3:drop database db_hive;

4:use rachel_db_hive;

5:创建表

 6:show tables

7:select

8:导入数据

9:Hive版本中select的区别

10:自定义别名

11:desc u_data;

12:desc extended u_data;

13:desc formatted u_data;

14:show functions

15:desc function upper;

16:select userid,upper(username) upper_name from student;


概述

hive是将HDFS上的结构化数据映射称为一张表

通过hive的表对数据处理

架构

1:用户接口 client

  • CLI(hive shell)
  • JDBC(java访问hive)
  • WEBUI(浏览器访问Hive)

2:元数据(Metastore)

  •         表名
  •         表所属数据库
  •         表的拥有着
  •         列名
  •         分区字段名
  •         表的类型(外部表、内部表)
  •         表数据所在的HDFS上的目录地址
    [rachel@bigdata-senior02 ~]$ mysql -u root -p
    Enter password: 
    mysql> select * from SDS
    	-> ;
    |SD_ID               |CD_ID               |INPUT_FORMAT                                                                                        |IS_COMPRESSED|IS_STOREDASSUBDIRECTORIES|LOCATION                                                                                            |NUM_BUCKETS|OUTPUT_FORMAT                                                                                       |SERDE_ID            |
    |--------------------|--------------------|----------------------------------------------------------------------------------------------------|-------------|-------------------------|----------------------------------------------------------------------------------------------------|-----------|----------------------------------------------------------------------------------------------------|--------------------|
    |1                   |1                   |org.apache.hadoop.mapred.TextInputFormat                                                            |0            |0                        |hdfs://ns/user/hive/warehouse/test                                                                  |-1         |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                          |1                   |
    |6                   |6                   |org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat                                              |0            |0                        |hdfs://ns/user/hive/warehouse/weblogs                                                               |-1         |org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat                                            |6                   |
    |11                  |11                  |org.apache.hadoop.mapred.TextInputFormat                                                            |0            |0                        |hdfs://ns/user/hive/warehouse/rachel.db/test                                                        |-1         |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                          |11                  |
    

    SD_ID  、CD_ID    是表的ID
    INPUT_FORMAT 因为底层是Mapreduce所以值org.apache.hadoop.mapred.TextInputFormat
    IS_COMPRESSED是否有压缩
    LOCATION  数据所在的HDFS目录地址

    对于SD_ID为11的表,在HIVE中可以查到对应的数据。

    hive (rachel)> show tables;
    OK
    tab_name
    test
    Time taken: 0.062 seconds, Fetched: 1 row(s)
    

    对于TBLS表保存着表的拥有者、表的类型、表名

    SELECT * from TBLS
    
    |TBL_ID              |CREATE_TIME|DB_ID               |LAST_ACCESS_TIME|OWNER                                                                                               |RETENTION  |SD_ID               |TBL_NAME                                                                                            |TBL_TYPE                                                                                            |VIEW_EXPANDED_TEXT                                                                                  |VIEW_ORIGINAL_TEXT                                                                                  |
    |--------------------|-----------|--------------------|----------------|----------------------------------------------------------------------------------------------------|-----------|--------------------|----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
    |1                   |1537004065 |1                   |0               |rachel                                                                                              |0          |1                   |test                                                                                                |MANAGED_TABLE                                                                                       |[NULL]                                                                                              |[NULL]                                                                                              |
    |6                   |1537028958 |1                   |0               |rachel                                                                                              |0          |6                   |weblogs                                                                                             |EXTERNAL_TABLE                                                                                      |[NULL]                                                                                              |[NULL]                                                                                              |
    |11                  |1537226416 |6                   |0               |rachel                                                                                              |0          |11                  |test                                                                                                |MANAGED_TABLE                                                                                       |[NULL]                                                                                              |[NULL]                                                                                              |
    

    注:元数据默认存储在自带的derby的数据库中,推荐使用Mysql来存储元数据。

3:hadoop

使用HDFS进行数据存储,使用mapreduce进行数据计算


4:Driver

    包含:解析器、优化器、编译器、执行器
        1:Driver接收来自客户端的Hive语句
        2:SQL parser对语句进行语法正确性验证
            检查表名、字段名、group的时候,某些字段是可以数据,某些是不可以的。
        3:根据sql的语法来优化
        4:SQL语句解析成MapReduce(可以执行的程序)
        5:执行MR程序

Hive的优点和使用场景

1:简单易用(操作接口采用类SQL语法,提供快速开发的能力)
2:避免我们写MR,减少开发人员的学习成本
3:统一的元数据管理,可与impala、spark等共享数据。
4:易扩展
5:数据离线处理:比如日志分析、海量结构化数据离线分析
6:Hive执行延迟比较高,因此Hive常用于数据离线分析,适合实时性要求不高的场景。
7:Hive的优势在于处理大数据,对于处理小数据没有优势,因为Hive的执行延迟性很高。

Hive的常用命令

DML是对数据的操作
	DML (load/insert/update/delete/merge, import/export, explain plan),
DDL是对库和表的操作
	DDL (create/drop/alter/truncate/show/describe), Statistics (analyze), Indexes, Archiving

单独启动metastore

[rachel@bigdata-senior02 hive-0.13.1-bin]$ bin/hive --service  metastore
Starting Hive Metastore Server

1:show databases;

hive (rachel)> show databases;
OK
database_name
default
rachel
Time taken: 0.081 seconds, Fetched: 2 row(s)

对于database_name其实并不是一个表,只是column header而已
和hive-site.xml中的配置有关系。可以显示当前访问的数据库,也可以显示列头

<property>
	<name>hive.cli.print.header</name>
	<value>true</value>
</property>

<property>
	<name>hive.cli.print.current.db</name>
	<value>true</value>
</property>

2:create database rachel_db_hive;

hive (default)> create database rachel_db_hive;
OK
Time taken: 10.458 seconds

3:drop database db_hive;

4:use rachel_db_hive;

5:创建表

CREATE TABLE u_data(
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

 6:show tables

hive (rachel_db_hive)> show tables;
OK
tab_name
u_data
Time taken: 0.041 seconds, Fetched: 1 row(s)

7:select

hive (rachel_db_hive)> select * from u_data;
OK
u_data.userid   u_data.movieid  u_data.rating   u_data.unixtime

8:导入数据

hive (rachel_db_hive)> load data local inpath '/opt/datas/u.data' into table u_data table u_data;
	Copying data from file:/opt/datas/u.data
	Copying file: file:/opt/datas/u.data
	Loading data to table rachel_db_hive.u_data
	Table rachel_db_hive.u_data stats: [numFiles=1, numRows=0, totalSize=1979173, rawDataSize=0]
	OK
	Time taken: 13.308 seconds

9:Hive版本中select的区别

在hive1.0之前,select取某一列的值的时候,走的是mapreduce,
在hive1.0之后,select 取出的某一列值,不再执行mapreduce

10:自定义别名

是需要走mapreduce程序的;

hive (rachel_db_hive)> select count(1) u_count from u_data;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1539517625007_0001, Tracking URL = http://bigdata-pro01.rachel.com:8088/proxy/application_1539517625007_0001/
Kill Command = /opt/modules/hadoop-2.5.0/bin/hadoop job  -kill job_1539517625007_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-10-14 08:10:16,687 Stage-1 map = 0%,  reduce = 0%
2018-10-14 08:11:17,397 Stage-1 map = 0%,  reduce = 0%
2018-10-14 08:11:19,549 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.57 sec
2018-10-14 08:11:53,327 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.6 sec
MapReduce Total cumulative CPU time: 6 seconds 600 msec
Ended Job = job_1539517625007_0001
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 6.6 sec   HDFS Read: 1979388 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 600 msec
OK
u_count
100000
Time taken: 165.921 seconds, Fetched: 1 row(s)

11:desc u_data;

col_name        data_type       comment
userid                  int                                         
movieid                 int                                         
rating                  int                                         
unixtime                string                                      
Time taken: 0.763 seconds, Fetched: 4 row(s)

12:desc extended u_data;

hive (rachel_db_hive)> desc extended u_data;
OK
col_name        data_type       comment
userid                  int                                         
movieid                 int                                         
rating                  int                                         
unixtime                string                                      
                 
Detailed Table Information      Table(tableName:u_data, dbName:rachel_db_hive, owner:rachel, createTime:1539516342, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:userid, type:int, comment:null), FieldSchema(name:movieid, type:int, comment:null), FieldSchema(name:rating, type:int, comment:null), FieldSchema(name:unixtime, type:string, comment:null)], location:hdfs://ns/user/hive/warehouse/rachel_db_hive.db/u_data, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=     , field.delim=
Time taken: 0.154 seconds, Fetched: 6 row(s)

13:desc formatted u_data;

显示表的详细信息,表的结构、表的说明、

hive (rachel_db_hive)> desc formatted u_data;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
userid                  int                                         
movieid                 int                                         
rating                  int                                         
unixtime                string                                      
                 
# Detailed Table Information             
Database:               rachel_db_hive           
Owner:                  rachel                   
CreateTime:             Sun Oct 14 07:25:42 EDT 2018     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://ns/user/hive/warehouse/rachel_db_hive.db/u_data   
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        numFiles                1                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               1979173             
        transient_lastDdlTime   1539518082          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.174 seconds, Fetched: 35 row(s)

14:show functions

--hive (rachel_db_hive)> show functions;
	OK
	tab_name
	!
	!=
	%
	&
	*
	+
	-
	/
	<
	<=
	<=>
	<>
	=
	==
	>
	>=
	^
	abs
	acos
	and
	array
	array_contains
	ascii
	asin
	assert_true
	atan
	avg
	base64
	between
	bin
	case
	ceil
	ceiling
	coalesce
	collect_list
	collect_set
	compute_stats
	concat
	concat_ws
	context_ngrams
	conv
	corr
	cos
	count
	covar_pop
	covar_samp
	create_union
	cume_dist
	current_database
	date_add
	date_sub
	datediff
	day
	dayofmonth
	decode
	degrees
	dense_rank
	div
	e
	elt
	encode
	ewah_bitmap
	ewah_bitmap_and
	ewah_bitmap_empty
	ewah_bitmap_or
	exp
	explode
	field
	find_in_set
	first_value
	floor
	format_number
	from_unixtime
	from_utc_timestamp
	get_json_object
	hash
	hex
	histogram_numeric
	hour
	if
	in
	in_file
	index
	inline
	instr
	isnotnull
	isnull
	java_method
	json_tuple
	lag
	last_value
	lcase
	lead
	length
	like
	ln
	locate
	log
	log10
	log2
	lower
	lpad
	ltrim
	map
	map_keys
	map_values
	matchpath
	max
	min
	minute
	month
	named_struct
	negative
	ngrams
	noop
	noopwithmap
	not
	ntile
	nvl
	or
	parse_url
	parse_url_tuple
	percent_rank
	percentile
	percentile_approx
	pi
	pmod
	posexplode
	positive
	pow
	power
	printf
	radians
	rand
	rank
	reflect
	reflect2
	regexp
	regexp_extract
	regexp_replace
	repeat
	reverse
	rlike
	round
	row_number
	rpad
	rtrim
	second
	sentences
	sign
	sin
	size
	sort_array
	space
	split
	sqrt
	stack
	std
	stddev
	stddev_pop
	stddev_samp
	str_to_map
	struct
	substr
	substring
	sum
	tan
	to_date
	to_unix_timestamp
	to_utc_timestamp
	translate
	trim
	ucase
	unbase64
	unhex
	unix_timestamp
	upper
	var_pop
	var_samp
	variance
	weekofyear
	when
	windowingtablefunction
	xpath
	xpath_boolean
	xpath_double
	xpath_float
	xpath_int
	xpath_long
	xpath_number
	xpath_short
	xpath_string
	year
	|
--	~
Time taken: 0.053 seconds, Fetched: 195 row(s)

15:desc function upper;

hive (rachel_db_hive)> desc function upper;
OK
tab_name
upper(str) - Returns str with all characters changed to uppercase
Time taken: 0.1 seconds, Fetched: 1 row(s)
hive (rachel_db_hive)> desc function extended upper;
OK
tab_name
upper(str) - Returns str with all characters changed to uppercase
Synonyms: ucase
Example:
  > SELECT upper('Facebook') FROM src LIMIT 1;
  'FACEBOOK'
Time taken: 0.076 seconds, Fetched: 5 row(s)

16:select userid,upper(username) upper_name from student;

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值