一、前言
hive UDF
二、Hive 常用的函数
2.1、关系操作符
|
Operator |
Operand types |
Description |
|
A = B |
All primitive types |
TRUE if expression A is equal to expression B otherwise FALSE |
|
A == B |
None! |
Fails because of invalid syntax. SQL uses =, not == |
|
A <> B |
All primitive types |
NULL if A or B is NULL, TRUE if expression A is NOT equal to expression B otherwise FALSE |
|
A < B |
All primitive types |
NULL if A or B is NULL, TRUE if expression A is less than expression B otherwise FALSE |
|
A <= B |
All primitive types |
NULL if A or B is NULL, TRUE if expression A is less than or equal to expression B otherwise FALSE |
|
A > B |
All primitive types |
NULL if A or B is NULL, TRUE if expression A is greater than expression B otherwise FALSE |
|
A >= B |
All primitive types |
NULL if A or B is NULL, TRUE if expression A is greater than or equal to expression B otherwise FALSE |
|
A IS NULL |
all types |
TRUE if expression A evaluates to NULL otherwise FALSE |
|
A IS NOT NULL |
All types |
TRUE if expression A evaluates to NULL otherwise FALSE |
|
A LIKE B |
strings |
NULL if A or B is NULL, TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A(similar to . in posix regular expressions) while the % character in B matches an arbitrary number of characters in A(similar to .* in posix regular expressions) e.g. 'foobar' like 'foo' evaluates to FALSE where as 'foobar' like 'foo_ _ _' evaluates to TRUE and so does 'foobar' like 'foo%' |
|
A RLIKE B |
strings |
NULL if A or B is NULL, TRUE if string A matches the Java regular expression B(See Java regular expressions syntax), otherwise FALSE e.g. 'foobar' rlike 'foo' evaluates to FALSE where as 'foobar' rlike '^f.*r$' evaluates to TRUE |
|
A REGEXP B |
strings |
Same as RLIKE |
2.2、代数操作符
|
Operator |
Operand types |
Description |
|
A + B |
All number types |
Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a float, therefore float is a containing type of integer so the + operator on a float and an int will result in a float. |
|
A - B |
All number types |
Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
|
A * B |
All number types |
Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. |
|
A / B |
All number types |
Gives the result of dividing B from A. The result is a double type. |
|
A % B |
All number types |
Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
|
A & B |
All number types |
Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
|
A | B |
All number types |
Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
|
A ^ B |
All number types |
Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
|
~A |
All number types |
Gives the result of bitwise NOT of A. The type of the result is the same as the type of A. |
2.3、复杂类型操作符
|
Constructor Function |
Operands |
Description |
|
Map |
(key1, value1, key2, value2, ...) |
Creates a map with the given key/value pairs |
|
Struct |
(val1, val2, val3, ...) |
Creates a struct with the given field values. Struct field names will be col1, col2, ... |
|
Array |
(val1, val2, ...) |
Creates an array with the given elements |
2.4、日期函数
|
返回值类型 |
名称 |
描述 |
|
string |
from_unixtime(int unixtime) |
将时间戳(unix epoch秒数)转换为日期时间字符串,例如from_unixtime(0)="1970-01-01 00:00:00" |
|
bigint |
unix_timestamp() |
获得当前时间戳 |
|
bigint |
unix_timestamp(string date) |
获得date表示的时间戳 |
|
bigint |
to_date(string timestamp) |
返回日期字符串,例如to_date("1970-01-01 00:00:00") = "1970-01-01" |
|
string |
year(string date) |
返回年,例如year("1970-01-01 00:00:00") = 1970,year("1970-01-01") = 1970 |
|
int |
month(string date) |
|
|
int |
day(string date) dayofmonth(date) |
|
|
int |
hour(string date) |
|
|
int |
minute(string date) |
|
|
int |
second(string date) |
|
|
int |
weekofyear(string date) |
|
|
int |
datediff(string enddate, string startdate) |
返回enddate和startdate的天数的差,例如datediff('2009-03-01', '2009-02-27') = 2 |
|
int |
date_add(string startdate, int days) |
加days天数到startdate: date_add('2008-12-31', 1) = '2009-01-01' |
|
int |
date_sub(string startdate, int days) |
减days天数到startdate: date_sub('2008-12-31', 1) = '2008-12-30' |
2.5、条件函数
|
返回值类型 |
名称 |
描述 |
|
- |
if(boolean testCondition, T valueTrue, T valueFalseOrNull) |
当testCondition为真时返回valueTrue,testCondition为假或NULL时返回valueFalseOrNull |
|
- |
COALESCE(T v1, T v2, ...) |
返回列表中的第一个非空元素,如果列表元素都为空则返回NULL |
|
- |
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END |
a = b,返回c;a = d,返回e;否则返回f |
|
- |
CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END |
a 为真,返回b;c为真,返回d;否则e |
2.6、字符串函数
|
返回值类型 |
名称 |
描述 |
|
Int |
length(string A) |
返回字符串长度 |
|
String |
reverse(string A) |
反转字符串 |
|
String |
concat(string A, string B...) |
合并字符串,例如concat('foo', 'bar')='foobar'。注意这一函数可以接受任意个数的参数 |
|
String |
substr(string A, int start) substring(string A, int start) |
返回子串,例如substr('foobar', 4)='bar' |
|
String |
substr(string A, int start, int len) substring(string A, int start, int len) |
返回限定长度的子串,例如substr('foobar', 4, 1)='b' |
|
String |
upper(string A) ucase(string A) |
转换为大写 |
|
String |
lower(string A) lcase(string A) |
转换为小写 |
|
String |
trim(string A) |
|
|
String |
ltrim(string A) |
|
|
String |
rtrim(string A) |
|
|
String |
regexp_replace(string A, string B, string C) |
Returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C e.g. regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc. |
|
String |
regexp_extract(string subject, string pattern, int intex) |
返回使用正则表达式提取的子字串。例如,regexp_extract('foothebar', 'foo(.*?)(bar)', 2)='bar'。注意使用特殊字符的规则:使用'\s'代表的是字符's';空白字符需要使用'\\s',以此类推。 |
|
String |
parse_url(string urlString, string partToExtract) |
解析URL字符串,partToExtract的可选项有:HOST, PATH, QUERY, REF, PROTOCOL, FILE, AUTHORITY, USERINFO。 |
|
例如, | ||
|
parse_url('http://facebook.com/path/p1.php?query=1', 'HOST')='facebook.com' | ||
|
parse_url('http://facebook.com/path/p1.php?query=1', 'PATH')='/path/p1.php' | ||
|
parse_url('http://facebook.com/path/p1.php?query=1', 'QUERY')='query=1',可以指定key来返回特定参数,key的格式是QUERY:<KEY_NAME>,例如QUERY:k1 | ||
|
parse_url('http://facebook.com/path/p1.php?query=1&field=2','QUERY','query')='1'可以用来取出外部渲染参数key对应的value值 | ||
|
parse_url('http://facebook.com/path/p1.php?query=1&field=2','QUERY','field')='2' | ||
|
parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'REF')='Ref' | ||
|
parse_url('http://facebook.com/path/p1.php?query=1#Ref', 'PROTOCOL')='http' | ||
|
String |
get_json_object(string json_string, string path) |
解析json字符串。若源json字符串非法则返回NULL。path参数支持JSONPath的一个子集,包括以下标记: |
|
$: Root object | ||
|
[]: Subscript operator for array | ||
|
&: Wildcard for [] | ||
|
.: Child operator | ||
|
String |
space(int n) |
返回一个包含n个空格的字符串 |
|
String |
repeat(string str, int n) |
重复str字符串n遍 |
|
String |
ascii(string str) |
返回str中第一个字符的ascii码 |
|
String |
lpad(string str, int len, string pad) |
左端补齐str到长度为len。补齐的字符串由pad指定。 |
|
String |
rpad(string str, int len, string pad) |
右端补齐str到长度为len。补齐的字符串由pad指定。 |
|
Array |
split(string str, string pat) |
返回使用pat作为正则表达式分割str字符串的列表。例如,split('foobar', 'o')[2] = 'bar'。 |
|
Int |
find_in_set(string str, string strList) |
Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. e.g. find_in_set('ab', 'abc,b,ab,c,def') returns 3 |
set mapred.reduce.tasks=100;
set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
set total.order.partitioner.path=/tmp/ range_key_list;
Select sale_id, amount from t_order
Cluster by sale_id
Sort by amount;生成区间文件,例如使用hadoop 自带的org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSede
例如有一个按id有序的t_sale表:
CREATE TABLE if not exists t_sale (
id int,
name string,
loc string
);
则生成按sale_id分发的区间文件的方法是:
create external table range_keys(sale_id int)
row format serde
'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'
stored as
inputformat
'org.apache.hadoop.mapred.TextInputFormat'
outputformat
'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
location '/tmp/range_key_list';
insert overwrite table range_keys
select distinct sale_id
from source t_sale sampletable(BUCKET 100 OUT OF 100 ON rand()) s
sort by sale_id;3.2、怎样写exist/in子句SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
可以改写为
SELECT a.key, a.value
FROM a LEFT OUTER JOIN b ON (a.key = b.key)
WHERE b.key <> NULL;
SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
可以改写为
SELECT a.key, a.value
FROM a LEFT OUTER JOIN b ON (a.key = b.key)
WHERE b.key <> NULL;
本文是Hadoop自学系列的第八篇,聚焦于Hive的常用函数,包括关系操作符、代数操作符、复杂类型操作符、日期函数、条件函数和字符串函数。深入探讨了如何将子查询转换为LEFT OUTER JOIN,以优化查询性能。
3090

被折叠的 条评论
为什么被折叠?



