目录
2:create database rachel_db_hive;
16:select userid,upper(username) upper_name from student;
概述
hive是将HDFS上的结构化数据映射称为一张表
通过hive的表对数据处理
架构

1:用户接口 client
- CLI(hive shell)
- JDBC(java访问hive)
- WEBUI(浏览器访问Hive)
2:元数据(Metastore)
- 表名
- 表所属数据库
- 表的拥有着
- 列名
- 分区字段名
- 表的类型(外部表、内部表)
- 表数据所在的HDFS上的目录地址
[rachel@bigdata-senior02 ~]$ mysql -u root -p Enter password: mysql> select * from SDS -> ; |SD_ID |CD_ID |INPUT_FORMAT |IS_COMPRESSED|IS_STOREDASSUBDIRECTORIES|LOCATION |NUM_BUCKETS|OUTPUT_FORMAT |SERDE_ID | |--------------------|--------------------|----------------------------------------------------------------------------------------------------|-------------|-------------------------|----------------------------------------------------------------------------------------------------|-----------|----------------------------------------------------------------------------------------------------|--------------------| |1 |1 |org.apache.hadoop.mapred.TextInputFormat |0 |0 |hdfs://ns/user/hive/warehouse/test |-1 |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |1 | |6 |6 |org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat |0 |0 |hdfs://ns/user/hive/warehouse/weblogs |-1 |org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat |6 | |11 |11 |org.apache.hadoop.mapred.TextInputFormat |0 |0 |hdfs://ns/user/hive/warehouse/rachel.db/test |-1 |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |11 |SD_ID 、CD_ID 是表的ID
INPUT_FORMAT 因为底层是Mapreduce所以值org.apache.hadoop.mapred.TextInputFormat
IS_COMPRESSED是否有压缩
LOCATION 数据所在的HDFS目录地址
对于SD_ID为11的表,在HIVE中可以查到对应的数据。hive (rachel)> show tables; OK tab_name test Time taken: 0.062 seconds, Fetched: 1 row(s)对于TBLS表保存着表的拥有者、表的类型、表名
SELECT * from TBLS |TBL_ID |CREATE_TIME|DB_ID |LAST_ACCESS_TIME|OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT |VIEW_ORIGINAL_TEXT | |--------------------|-----------|--------------------|----------------|----------------------------------------------------------------------------------------------------|-----------|--------------------|----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------| |1 |1537004065 |1 |0 |rachel |0 |1 |test |MANAGED_TABLE |[NULL] |[NULL] | |6 |1537028958 |1 |0 |rachel |0 |6 |weblogs |EXTERNAL_TABLE |[NULL] |[NULL] | |11 |1537226416 |6 |0 |rachel |0 |11 |test |MANAGED_TABLE |[NULL] |[NULL] |注:元数据默认存储在自带的derby的数据库中,推荐使用Mysql来存储元数据。
3:hadoop
使用HDFS进行数据存储,使用mapreduce进行数据计算
4:Driver
包含:解析器、优化器、编译器、执行器
1:Driver接收来自客户端的Hive语句
2:SQL parser对语句进行语法正确性验证
检查表名、字段名、group的时候,某些字段是可以数据,某些是不可以的。
3:根据sql的语法来优化
4:SQL语句解析成MapReduce(可以执行的程序)
5:执行MR程序
Hive的优点和使用场景
1:简单易用(操作接口采用类SQL语法,提供快速开发的能力)
2:避免我们写MR,减少开发人员的学习成本
3:统一的元数据管理,可与impala、spark等共享数据。
4:易扩展
5:数据离线处理:比如日志分析、海量结构化数据离线分析
6:Hive执行延迟比较高,因此Hive常用于数据离线分析,适合实时性要求不高的场景。
7:Hive的优势在于处理大数据,对于处理小数据没有优势,因为Hive的执行延迟性很高。
Hive的常用命令
DML是对数据的操作
DML (load/insert/update/delete/merge, import/export, explain plan),
DDL是对库和表的操作
DDL (create/drop/alter/truncate/show/describe), Statistics (analyze), Indexes, Archiving
单独启动metastore
[rachel@bigdata-senior02 hive-0.13.1-bin]$ bin/hive --service metastore
Starting Hive Metastore Server
1:show databases;
hive (rachel)> show databases;
OK
database_name
default
rachel
Time taken: 0.081 seconds, Fetched: 2 row(s)
对于database_name其实并不是一个表,只是column header而已
和hive-site.xml中的配置有关系。可以显示当前访问的数据库,也可以显示列头
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
2:create database rachel_db_hive;
hive (default)> create database rachel_db_hive;
OK
Time taken: 10.458 seconds
3:drop database db_hive;
4:use rachel_db_hive;
5:创建表
CREATE TABLE u_data(
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
6:show tables
hive (rachel_db_hive)> show tables;
OK
tab_name
u_data
Time taken: 0.041 seconds, Fetched: 1 row(s)
7:select
hive (rachel_db_hive)> select * from u_data;
OK
u_data.userid u_data.movieid u_data.rating u_data.unixtime
8:导入数据
hive (rachel_db_hive)> load data local inpath '/opt/datas/u.data' into table u_data table u_data;
Copying data from file:/opt/datas/u.data
Copying file: file:/opt/datas/u.data
Loading data to table rachel_db_hive.u_data
Table rachel_db_hive.u_data stats: [numFiles=1, numRows=0, totalSize=1979173, rawDataSize=0]
OK
Time taken: 13.308 seconds
9:Hive版本中select的区别
在hive1.0之前,select取某一列的值的时候,走的是mapreduce,
在hive1.0之后,select 取出的某一列值,不再执行mapreduce
10:自定义别名
是需要走mapreduce程序的;
hive (rachel_db_hive)> select count(1) u_count from u_data;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1539517625007_0001, Tracking URL = http://bigdata-pro01.rachel.com:8088/proxy/application_1539517625007_0001/
Kill Command = /opt/modules/hadoop-2.5.0/bin/hadoop job -kill job_1539517625007_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-10-14 08:10:16,687 Stage-1 map = 0%, reduce = 0%
2018-10-14 08:11:17,397 Stage-1 map = 0%, reduce = 0%
2018-10-14 08:11:19,549 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.57 sec
2018-10-14 08:11:53,327 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.6 sec
MapReduce Total cumulative CPU time: 6 seconds 600 msec
Ended Job = job_1539517625007_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.6 sec HDFS Read: 1979388 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 600 msec
OK
u_count
100000
Time taken: 165.921 seconds, Fetched: 1 row(s)
11:desc u_data;
col_name data_type comment
userid int
movieid int
rating int
unixtime string
Time taken: 0.763 seconds, Fetched: 4 row(s)
12:desc extended u_data;
hive (rachel_db_hive)> desc extended u_data;
OK
col_name data_type comment
userid int
movieid int
rating int
unixtime string
Detailed Table Information Table(tableName:u_data, dbName:rachel_db_hive, owner:rachel, createTime:1539516342, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:userid, type:int, comment:null), FieldSchema(name:movieid, type:int, comment:null), FieldSchema(name:rating, type:int, comment:null), FieldSchema(name:unixtime, type:string, comment:null)], location:hdfs://ns/user/hive/warehouse/rachel_db_hive.db/u_data, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format= , field.delim=
Time taken: 0.154 seconds, Fetched: 6 row(s)
13:desc formatted u_data;
显示表的详细信息,表的结构、表的说明、
hive (rachel_db_hive)> desc formatted u_data;
OK
col_name data_type comment
# col_name data_type comment
userid int
movieid int
rating int
unixtime string
# Detailed Table Information
Database: rachel_db_hive
Owner: rachel
CreateTime: Sun Oct 14 07:25:42 EDT 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://ns/user/hive/warehouse/rachel_db_hive.db/u_data
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 0
rawDataSize 0
totalSize 1979173
transient_lastDdlTime 1539518082
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.174 seconds, Fetched: 35 row(s)
14:show functions
--hive (rachel_db_hive)> show functions;
OK
tab_name
!
!=
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
and
array
array_contains
ascii
asin
assert_true
atan
avg
base64
between
bin
case
ceil
ceiling
coalesce
collect_list
collect_set
compute_stats
concat
concat_ws
context_ngrams
conv
corr
cos
count
covar_pop
covar_samp
create_union
cume_dist
current_database
date_add
date_sub
datediff
day
dayofmonth
decode
degrees
dense_rank
div
e
elt
encode
ewah_bitmap
ewah_bitmap_and
ewah_bitmap_empty
ewah_bitmap_or
exp
explode
field
find_in_set
first_value
floor
format_number
from_unixtime
from_utc_timestamp
get_json_object
hash
hex
histogram_numeric
hour
if
in
in_file
index
inline
instr
isnotnull
isnull
java_method
json_tuple
lag
last_value
lcase
lead
length
like
ln
locate
log
log10
log2
lower
lpad
ltrim
map
map_keys
map_values
matchpath
max
min
minute
month
named_struct
negative
ngrams
noop
noopwithmap
not
ntile
nvl
or
parse_url
parse_url_tuple
percent_rank
percentile
percentile_approx
pi
pmod
posexplode
positive
pow
power
printf
radians
rand
rank
reflect
reflect2
regexp
regexp_extract
regexp_replace
repeat
reverse
rlike
round
row_number
rpad
rtrim
second
sentences
sign
sin
size
sort_array
space
split
sqrt
stack
std
stddev
stddev_pop
stddev_samp
str_to_map
struct
substr
substring
sum
tan
to_date
to_unix_timestamp
to_utc_timestamp
translate
trim
ucase
unbase64
unhex
unix_timestamp
upper
var_pop
var_samp
variance
weekofyear
when
windowingtablefunction
xpath
xpath_boolean
xpath_double
xpath_float
xpath_int
xpath_long
xpath_number
xpath_short
xpath_string
year
|
-- ~
Time taken: 0.053 seconds, Fetched: 195 row(s)
15:desc function upper;
hive (rachel_db_hive)> desc function upper;
OK
tab_name
upper(str) - Returns str with all characters changed to uppercase
Time taken: 0.1 seconds, Fetched: 1 row(s)
hive (rachel_db_hive)> desc function extended upper;
OK
tab_name
upper(str) - Returns str with all characters changed to uppercase
Synonyms: ucase
Example:
> SELECT upper('Facebook') FROM src LIMIT 1;
'FACEBOOK'
Time taken: 0.076 seconds, Fetched: 5 row(s)
1956

被折叠的 条评论
为什么被折叠?



