sampling采样

最新推荐文章于 2025-01-02 14:54:55 发布

隔壁老K~

最新推荐文章于 2025-01-02 14:54:55 发布

阅读量659

点赞数 13

文章标签： hive

本文链接：https://blog.youkuaiyun.com/weixin_48098867/article/details/138122550

版权

本文详细介绍了Hive中的数据采样技术，如tablesample函数，以及如何使用虚拟列处理数据和不同类型的函数（内置函数、自定义函数、UDF、UDAF、UDTF），包括字符串操作、时间处理、数学运算和条件判断，帮助开发者理解和应用这些技术在实际开发中提高效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

sampling采样

正常开发中我们会获取全量数据做开发使用，一般会从中获取一部分数据作为测试数据进行开发。

一般做测试数据的数据量是全部数据的1/10-几百分之一不等，但是一般也有几万条

如果我们自己提取测试数据一般使用sampling进行采样，数据随机性更高，数据质量更高，数据覆盖更全面。

格式：select.....from .....tablesample(bucket  x  out  of y on 字段名)；
create table  tb_bucket_9(
id int,
name string,
story_id int
)
clustered by (story_id) into 9 buckets --不带排序
clustered by (story_id) sorted by (story_id) into 9 buckets --带排序
select * from 表名 tablesample(bucket x out of y on 列)
y  决定要几桶，num = 表的桶数/y
x决定要几桶
select * from tb_bucket_9 tablesample (bucket 3 out of 3 on story_id);
--num=9/3 一共抽取三桶分别是第3，6，9桶
select * from tb_bucket_9  tablesample (bucket 1 out of 18 on story_id);
--num=9/18 一共抽取1/2桶，在第一桶中
--在开发中没我们不用纠结到底是怎样采样的，因为数据具有随机性，我们做测试数据的可靠性才高，我们要知道大概抽取的数据量即可
--注意：x要小于y，否则报错

virtual colums虚拟列

虚拟列：这个列，并不是真的存在的，而是元数据伪造的数据列

input_file_name ，显示数据行所在的具体文件

block_offset_inside_file 显示数据行所在文件的偏移量

row_offset_inside_block显示数据在HDFS块的偏移量

在开发中我们经常会根据文件名称分组获取数据，此时查询范围可以得到一定的缩小，其余场景虚拟列应用场景不多

函数的分类标准

函数部分内容很多，如果死记硬背难度极大，我们要做到的就是听懂，知道有这个功能，如果遇到了回过头来查询。

函数:就是程序员或者服务将一定的功能打包封装起来，用固定的字符组代替该功能，如果我们需要使用时，调用该字符组合名称，即可以使用该功能。
比如:max函数是获得最大值，我们没有亲自进行比较，而是系统帮我们把该功能封装好，即可使用该功能

hive中的函数分为：内置函数和自定义函数两种

内置函数：系统定义的函数，安装hive后自动获取

自定义函数：用户自己定义的函数

hive中的函数分类也可以安装数据输入内容分为：UDF UDAF UDTF三类

UDF：普通函数一进一出 round

UDAF：聚合函数多进一出 max min…

UDTF：表输出函数一进多出 explode

查看函数的使用方法

1、查看hive中有那些函数

show functions

2、查看函数或者运算符的使用方法

describe function extended 函数名

字符串函数

创建一个表，什么样的数据要使用字符串接收，什么样的数据要使用数值接受？

身份证号：字符串

学号：字符串

年龄：数值

身高：数值

价格：数值

在数据使用的过程中，需要参与计算的数据我们用数值型存储

在数据使用过程中，我们需要计数，分类（拆分或合并）的数据我们用字符串存储

--1、concat字符串拼接
select concat('hello','world');
--可以拼接多个字符串
select concat('菩提本无树','明镜亦非台','本来无一物','何处惹尘埃');
--2、concat_ws 字符串拼接（可以指定连接符号）
--查询使用方法
desc function extended concat_ws;
--官方实例
--select concat_ws('.','www',array('facebook','com')) from src limit 1;
--使用格式：concat_ws（分隔符，字符串|字符串数组）
select  concat(',','关羽','张飞');
select concat_ws(',','关羽','张飞','赵云','黄忠','马超');
--可与使用数组（数组内部的数据连接时会强行转换为字符串类型，转换失败则报错）
select concat_ws(',',`array`('西施','貂蝉','杨玉环'),'王昭君');
select concat_ws(',',`array`('西施','貂蝉','杨玉环'),'王昭君',false);
--布尔值类型不能用于连接
--3、length 计算字符串长度
select length('君不见黄河之水天上来');
--4、lower 将数据转换为小写，upper将数据转换为大写
select lower('ASDASZXCAS');
select upper('assdasddas');
--5、trim  裁剪字符串两次的空格
--仅去除左右两侧的空格，中间的空格不会被取消
select trim('  新月曲如眉，未有团圆意   ');
--6、split 按照指定的分割符拆分字符串
--格式：split（字符串，拆分分割符）
select split('怒发冲冠凭栏处，潇潇雨歇，抬望眼，仰天长啸，装还激烈',',');
--拆分后，该字符串变成了一个数组，将每一个字符串放置在数组内部
--7、substr  截取部分字符串，完全等价于substring
--格式：substr（要借去的字符串，截取位置，截取长度）
--特别注意：此处和其他编程语言截取不一样，是从1开始数的
select substr('英姿飒爽五尺枪，曙光初照练兵场，中华儿女多奇志，不爱红装爱武装',1,2);
select substr('英姿飒爽五尺枪，曙光初照练兵场，中华儿女多奇志，不爱红装爱武装',2,5);
--此处还可以使用负数索引，从右至左，从-1开始依次递减
select substr('英姿飒爽五尺枪，曙光初照练兵场，中华儿女多奇志，不爱红装爱武装',-2,5);
--8、parse_url url解析函数
desc function extended parse_url
--官方示例
--  > SELECT parse_url('http://
--  facebook.com  域名也叫主机地址  HOST
--  /path/p1.php 资源路径    PATH
--  ?query=1 参数 QUERY
select  parse_url('http://facebook.com/path/p1.php?query=1','HOST');--获取域名
select  parse_url('http://facebook.com/path/p1.php?query=1','PROTOCOL');--获取协议
select  parse_url('http://facebook.com/path/p1.php?query=1','PATH');--获取资源路径
select  parse_url('http://facebook.com/path/p1.php?query=1','QUERY');--获取参数
--9、get_json_object 获取json解析函数
--json 函数我们可以先理解为map类型和array类型的互相嵌套
--格式：get_Json_object(json数据，数据提取规则)
desc function extended get_json_object;
select get_json_object('{"name":"杨过","age":"18"}','$.name');
select get_json_object('[{"name":"杨过","age":"18"},{"name":"小龙女","age":"26"}]','$.[1].name');
--小龙女，  $表示json对象

时间函数

--时间函数
--1、获取当前时间
select `current_date`();
--2、获取当前日期时间
select `current_timestamp`();
--3、to_date将数据转换为日期类型
select to_date(`current_timestamp`());--将时间类型转换为日期类型
select to_date('2022-12-13 12:22:24');--将日期时间类型字符串转换为日期
select to_date('2022-11-23 12:');--将不完整的日期时间字符串转换为日期
select  to_date('2022-12-'); -- 如果日期不完整，则转换为null
--获取指定时间的部分信息
select year(`current_timestamp`());
select month(`current_timestamp`());
select hour(`current_timestamp`());
select weekofyear(`current_timestamp`()); --获取一年中的第几周
select `dayofweek`(`current_timestamp`()); --获取一周中的第几天，hive中是从周日开始算第一天的
--4、datediff 获取两个时间的时间差
--计算规则：前面的日期减去后面的日期
select datediff('2022-12-13','2022-12-11');
select datediff('2023-12-23 23:34:21','2023-12-11 21:23:34');
--5、时间便宜
--data_add 增加时间，时间向后偏移
select date_add('2022-12-11',12);
--data_sub 减少时间，时间向前偏移
select date_sub('2022-12-11',3);
--其实只需要记住date_add就可以，如果需要向前便宜时间，就是用负数的偏移量就可以了
--6、将时间类型转换为时间戳
--将事件类型转换
select unix_timestamp(`current_timestamp`());
--将时间格式字符串转换
select unix_timestamp('2022-12-11 08:12:22');
--如果时间字符串，不是标准时间格式，将返回null
select unix_timestamp('2022年12月11日 09时12分22秒');
--如果想转换非标准时间需要指定该字符串的时间格式
select unix_timestamp('2022年12月11日 09时12分22秒','yyy年mm月dd日 hh时mm分ss秒');
--7、将时间戳转换为时间类型
select from_unixtime(1324432312);
--转换为指定的格式时间类型字符串
select from_unixtime(123456,'yyy年mm月dd日  hh时mm分ss秒');
--8、date_format   将时间类型数据，或标准事件类型字符串，转换为指定格式的时间字符串
select date_format(`current_timestamp`(),'yyy年mm月dd日  hh时mm分ss秒');  --格式化日期数据
select date_format(`current_date`(),'yyy年mm月dd日  hh时mm分ss秒'); --格式化时间数据
select date_format('2022-12-11 08:12:33','yyy年mm月dd日  hh时mm分ss秒')  --格式化标准时间类型字符串

数学函数

--数学函数
--1、rand()    获取从0到1的随机数
--获取随机数，每次都不一样
select rand();
--获取随机数，每次获取值都相同，在括号内填写固定值，每次获取的随机数都相同，我们称这个数值为随机数种子
select rand(11);
--2、round  保留n位小鼠，并且四舍五入
select round(1.234);  --取整，并且四舍五入
select round(1.3456,2); --保留两位小数
--3、floor 向下取整
select `floor`(1.2345);----1
select `floor`(-1.123123);  --2
--4、ceil向上取整
select ceil(1.345);  --2
select ceil(-1.34423); --1
--思考：获取1-7的随机整数怎么取？
select round(rand()*6+1);
--思考：可以仅使用round完成向下取整 和 向上取整吗？
--向上取整
select round(1.023  + 0.5);
--向下取整
select round(1.023 - 0.5);

条件函数

--条件函数
--1、if条件函数
--格式：if（条件，条件成立时使用的数据，条件不成立时使用的数据）
select `if`(1>2,'成立','不成立');
select `if`(5>2,'成立','不成立');
--需求：查询students表中的数据内容genders字段为男，返回王子，如果为女返回公主
select name,`if`(gender = '男','王子','公主') from students;
--2、空值判断 is null is  not  null
--null  类型数据，无法进行比较判断和等值判断，返回的数据均为null
select null!=1;
select null=null;
select 3>null;
--如果想进行控制判断血药借助is 关键词
select 3 is null;
select  3 is not null;
select null is null;
select null is  not  null;
--3、nvl  传入一个数据，赋值一个默认值，如果该数据不为空，则返回该数据，如果该数据为空则返回默认值
select nvl(3,'默认值');
select nvl(null,'默认值');
--4、coalesce传入多个数据，返回第一个不为空的数据
select coalesce(12,null,14,'nihao');
select coalesce(null,4,5);
select coalesce(null,null,null)--null
--5、case when 多情况判断
/*
使用方式一：
case
when 条件1 then  条件1成立时返回的数据
when 条件2 then 条件2成立时返回的数据
when 条件3 then 条件3成立时返回的数据
........
else 所有条件均不成立时返回的数据
end 代表当前case when 语句结束
*/
--需求：需要将支付类型的编号修改为支付类型的名称0-现金支付 1-微信支付 2-支付宝支付 3-银行卡支付 4-未知 orders
select orderid,
       case
           when paytype = 0 then '现金支付'
           when paytype = 1 then '微信支付'
           when paytype = 2 then '支付宝支付'
           when paytype = 3 then '银行卡支付'
           when paytype = 4 then '未知'
           else '数据错误'
           else  as paytype
from orders;
/*
使用方式二：
case  表达式
when 表达式值1 then  表达式值为1时返回的数据
when 表达式值2 then  表达式值为2时返回的数据
when 表达式值3 then 表达式值为3时返回的数据
....
else 所有表达式的值均不成立时返回的数据
end 代表当前case when 语句结束
*/
--需求：需要将支付类型的编号修改为支付类型的名称0-现金支付 1-微信支付 2-支付宝支付 3-银行卡支付 4-未知 orders
select orderid,
       case paytype
           when 0 then'现金支付'
           when 1 then'微信支付'
           when 2 then '支付宝支付'
           when 3 then '银行卡支付'
           when 4 then '其他'
           else '数据错误' end as paytype
from orders;