count(distinct colA)与group by

本文深入探讨了SQL中GROUP BY与COUNT(DISTINCT)的区别,解析了这两种方法在处理海量数据时的不同策略及优劣,帮助读者理解如何在时间和空间效率之间做出选择。

在传统关系型数据库中,group by与count(distinct)都是很常见的操作。count(distinct colA)就是将colA中所有出现过的不同值取出来,相信只要接触过数据库的同学都能明白什么意思。

count(distinct colA)的操作也可以用group by的方式完成,具体代码如下:

select count(distinct colA) from table1;
select count(1) from (select colA from table1 group by colA)alias_1;

这两者最后得出的结果是一致的,但是具体的实现方式,有什么不同呢?
上面两种方式本质就是时间与空间的权衡。
distinct需要将colA中的所有内容都加载到内存中,大致可以理解为一个hash结构,key自然就是colA的所有值。因为是hash结构,那运算速度自然就快。最后计算hash中有多少key就是最终的结果。
那么问题来了,在现在的海量数据环境下,需要将所有不同的值都存起来,这个内存消耗,是可想而知的。所以如果数据量特别大,可能会out of memory。。。
group by的实现方式是先将colA排序。排序大家都不陌生,拿最见得快排来说,时间复杂度为O(nlogn),而空间复杂度只有O(1)。这样一来,即使数据量再大一些,group by基本也能hold住。但是因为需要做一次O(nlogn) 的排序,时间自然会稍微慢点。。。
总结起来就是,count(distinct)吃内存,查询快;group by空间复杂度小,在时间复杂度允许的情况下,可以发挥他的空间复杂度优势。

转载侵删!!!

select t1.stats_date_period ,t1.loan_account_id ,t1.user_id ,t1.active_os_list ,t1.active_platform_name_list ,t1.active_sdk_type_list ,t1.active_device_id_list ,t1.latest_active_os ,t1.latest_active_platform_name ,t1.latest_active_sdk_type ,t1.latest_active_device_id ,t1.latest_active_city ,t1.latest_active_province ,t1.latest_active_country ,t1.login_cnt ,t1.repayment_plan_exposure_cnt ,t1.living_confirmed_photo_click_cnt ,t1.app_launch_cnt ,t1.credit_increase_click_cnt ,t1.contact_from_contacts_cnt ,t1.app_exit_cnt ,t1.coupon_pageview_cnt ,t1.change_bank_card_click_cnt ,t1.delete_account_succ_cnt ,t2.active_os_cnt ,t2.active_platform_name_cnt ,t2.active_sdk_type_list ,t2.active_ip_cnt ,t2.active_device_cnt ,t2.active_city_cnt ,t2.active_province_cnt ,t2.active_country_cnt ,t3.login_method_distribution ,t4.app_launch_hour_distribution ,t5.app_consecutive_launch_days ,t6.bankcard_valid_cnt ,t6.bankcard_invalid_cnt from ( select stats_date_period ,loan_account_id ,user_id ,active_os_list ,active_platform_name_list ,active_sdk_type_list ,active_device_id_list ,count(case when event_code = 'login' then 1 end) as login_cnt ,count(case when event_code = 'repayment_plan_exposure' then 1 end) as repayment_plan_exposure_cnt ,count(case when event_code = 'living_confirmed_photo_click' then 1 end) as living_confirmed_photo_click_cnt ,count(case when event_code = 'app_start' then 1 end) as app_launch_cnt ,count(case when event_code = 'credit_increase_click' then 1 end) as credit_increase_click_cnt ,count(case when event_code = 'contact_from_contacts' then 1 end) as contact_from_contacts_cnt ,count(case when event_code = 'app_end' then 1 end) as app_exit_cnt ,count(case when event_code = 'coupon_pageview' then 1 end) as coupon_pageview_cnt ,count(case when event_code = 'change_bank_card_click' then 1 end) as change_bank_card_click_cnt ,count(case when event_code = 'delete_account_succ' then 1 end) as delete_account_succ_cnt ,max(latest_active_os) as latest_active_os ,max(latest_active_platform_name) as latest_active_platform_name ,max(latest_active_sdk_type) as latest_active_sdk_type ,max(latest_active_device_id) as latest_active_device_id ,max(latest_active_city) as latest_active_city ,max(latest_active_province) as latest_active_province ,max(latest_active_country) as latest_active_country from ( select stats_date_period ,loan_account_id ,user_id ,active_os_list ,active_platform_name_list ,active_sdk_type_list ,active_device_id_list ,event_code ,ip ,created_ts ,created_hour ,city ,province ,country ,event_code ,login_type ,last_value(os) over(partition by user_id order by created_ts asc IGNORE NULLS) as latest_active_os -- 过滤空值 ,last_value(active_platform_name_list) over(partition by user_id order by created_ts asc IGNORE NULLS) as latest_active_platform_name ,last_value(active_sdk_type_list) over(partition by user_id order by created_ts asc IGNORE NULLS) as latest_active_sdk_type ,last_value(active_device_id_list) over(partition by user_id order by created_ts asc IGNORE NULLS) as latest_active_device_id ,last_value(city) over(partition by user_id order by created_ts asc IGNORE NULLS) as latest_active_city ,last_value(province) over(partition by user_id order by created_ts asc IGNORE NULLS) as latest_active_province ,last_value(country) over(partition by user_id order by created_ts asc IGNORE NULLS) as latest_active_country from user_detail ) a group by stats_date_period ,loan_account_id ,user_id ,active_os_list ,active_platform_name_list ,active_sdk_type_list ,active_device_id_list ) t1 left join ( select stats_date_period ,loan_account_id ,user_id ,count(distinct active_os_list) as active_os_cnt ,count(distinct active_platform_name_list) as active_platform_name_cnt ,count(distinct active_sdk_type_list) as active_sdk_type_list ,count(distinct ip) as active_ip_cnt ,count(distinct active_device_id_list) as active_device_cnt ,count(distinct case when city is not null then city end) as active_city_cnt ,count(distinct case when province is not null then province end) as active_province_cnt ,count(distinct case when country is not null then country end) as active_country_cnt from user_detail group by stats_date_period ,loan_account_id ,user_id ) t2 on t1.stats_date_period = t2.stats_date_period and t1.user_id = t2.user_id and t1.loan_account_id = t2.loan_account_id left join -- 登录方式分布 ( select stats_date_period ,loan_account_id ,user_id ,concat(login_type, '', cast(login_type_cnt as string)) as login_method_distribution from ( select stats_date_period ,loan_account_id ,user_id ,login_type ,count(1) as login_type_cnt from user_detail where login_type is not null group by stats_date_period ,loan_account_id ,user_id ,login_type ) a ) t3 on t1.stats_date_period = t3.stats_date_period and t1.user_id = t3.user_id and t1.loan_account_id = t3.loan_account_id left join -- APP启动时间段分布 ( select stats_date_period ,loan_account_id ,user_id ,concat(created_hour, '', cast(login_hour_cnt as string)) as app_launch_hour_distribution from ( select stats_date_period ,loan_account_id ,user_id ,created_hour ,count(1) as login_hour_cnt from user_detail where event_code = 'app_start' group by stats_date_period ,loan_account_id ,user_id ,created_hour ) a ) t4 on t1.stats_date_period = t4.stats_date_period and t1.user_id = t4.user_id and t1.loan_account_id = t4.loan_account_id left join -- 用户最后一次连续登陆天数 ( select user_id ,loan_account_id ,stats_date_period ,consecutive_days as app_consecutive_launch_days from ( select user_id ,loan_account_id ,stats_date_period ,min(dt) as session_start_date -- 连续登陆起始日 ,max(dt) as session_end_date -- 连续会话结束日 ,count(distinct dt) as consecutive_days -- 连续登陆天数 ,max(last_login_dt) as last_login_dt -- 用户在该区间最后一次登录日期 from ( select dt ,stats_date_period ,user_id ,loan_account_id ,cast(dt as int) - row_number() over (partition by user_id, b_stats_date_period order by dt) as session_start -- 连续登陆标识 ,max(dt) over (partition by user_id, b_stats_date_period) as last_login_dt -- 用户在该区间最后一次登录日期 from ( select dt ,user_id ,loan_account_id ,stats_date_period from user_detail where event_code = 'app_start' group by dt ,user_id ,loan_account_id ,stats_date_period ) a ) b group by user_id ,loan_account_id ,stats_date_period ) c where last_login_dt = session_end_date -- 限定最后登陆日期=某次连续登陆最后日期,即可取得最后一次连续登陆的天数(含1天) ) t5 on t1.stats_date_period = t5.stats_date_period and t1.user_id = t5.user_id and t1.loan_account_id = t5.loan_account_id left join -- 银行卡不同状态绑定数 ( select b_stats_date_period as stats_date_period ,user_id ,count(distinct if(available_status = 'A', account_number, null)) as bankcard_valid_cnt ,count(distinct if(available_status = 'U', account_number, null)) as bankcard_invalid_cnt from ( select case when dt = '20251204' then array('T1','T3','T7','T30') when dt between '20251202' and '20251204' then array('T3','T7','T30') when dt between '20251128' and '20251204' then array('T7','T30') when dt between '20251105' and '20251204' then array('T30') end as stats_date_period ,user_id ,account_number ,available_status from ( select dt ,user_id ,account_number ,available_status ,row_number() over (partition by account_number order by updated_ts desc) as rn from ec_dim.dim_ec_df_loan_bank_account -- 只有最近六天 where dt between '20251105' and '20251204' ) a where rn = 1 -- 取最新状态 ) b LATERAL VIEW explode(stats_date_period) b AS b_stats_date_period group by b_stats_date_period ,user_id ) t6 on t1.stats_date_period = t6.stats_date_period and t1.user_id = t6.user_id 哪里缺少右括号了。帮我看一下
12-06
SET @prev_n1 = ''; SET @prev_unitpri = NULL; SET @group_id = 0; SET @rownum = 0; SELECT * FROM ( SELECT no3, date3, vendorid3, vendor, model3, sort3, color3, size3, unitpri3, N1, is_blank, group_id FROM ( -- 实际数据行 SELECT tb1.*, 0 AS is_blank, @group_id := IF(@prev_n1 = N1 AND @prev_unitpri = unitpri3, @group_id, @group_id + 1) AS group_id, @prev_n1 := N1, @prev_unitpri := unitpri3, @rownum := @rownum + 1 AS rn FROM ( -- 原始查询 SELECT * FROM ( SELECT no3,date3,vendorid3,vendor,model3,sort3,color3,size3,unitpri3, CONCAT(vendorid3,vendor,model3,sort3,color3,size3) AS N1 FROM ( SELECT distinct subptin1.invono no3,subptin1.date date3,subptin1.vendorid vendorid3, vendor,subptin1.createid,subptin1.lconfirma,subptou2.model model3,subptou2.platsort sort3, subptou2.color color3,subptou2.size size3,subptou2.platsort,subptou2.cola,subptin2.qty, subptin2.weight,subptin2.unitpri unitpri3,round(subptin2.qty*subptin2.unitpri,2) amt, subptin1.prncount,subptin2.mem,subptin1.prndatetime,subptin1.editordt,subptin1.confirmadt,subptin1.createdt FROM subptin1 INNER JOIN subptin2 ON subptin2.invono = subptin1.invono INNER JOIN subptou2 ON subptou2.mid = subptin2.id INNER JOIN subptou1 ON subptou1.invono = subptou2.invono LEFT JOIN subvendor1 ON subvendor1.vendorid = subptin1.vendorid WHERE DATE_FORMAT(subptin1.date,'%Y-%m') >= DATE_FORMAT(DATE_SUB(CURDATE(), INTERVAL 1 MONTH),'%Y-%m') AND subptin2.mem NOT LIKE '%退%' AND subptin2.mem NOT LIKE '%原%' AND subptin2.mem NOT LIKE '%不%' AND subptin2.mem NOT LIKE '%X%' ) tb0 ) tb1 WHERE N1 IN ( SELECT N1 FROM ( SELECT COUNT(1), N1 FROM ( SELECT distinct vendorid3,vendor,model3,sort3,color3,size3,unitpri3, CONCAT(vendorid3,vendor,model3,sort3,color3,size3) AS N1 FROM ( SELECT subptin1.invono,subptin1.date,subptin1.vendorid vendorid3,vendor, subptin1.createid,subptin1.lconfirma,subptou2.model model3,subptou2.platsort sort3, subptou2.color color3,subptou2.size size3,subptou2.platsort,subptou2.cola,subptin2.qty, subptin2.weight,subptin2.unitpri unitpri3,round(subptin2.qty*subptin2.unitpri,2) amt, subptin1.prncount,subptin2.mem,subptin1.prndatetime,subptin1.editordt,subptin1.confirmadt,subptin1.createdt FROM subptin1 INNER JOIN subptin2 ON subptin2.invono = subptin1.invono INNER JOIN subptou2 ON subptou2.mid = subptin2.id INNER JOIN subptou1 ON subptou1.invono = subptou2.invono LEFT JOIN subvendor1 ON subvendor1.vendorid = subptin1.vendorid WHERE DATE_FORMAT(subptin1.date,'%Y-%m') >= DATE_FORMAT(DATE_SUB(CURDATE(), INTERVAL 1 MONTH),'%Y-%m') AND subptin2.mem NOT LIKE '%退%' AND subptin2.mem NOT LIKE '%原%' AND subptin2.mem NOT LIKE '%不%' AND subptin2.mem NOT LIKE '%X%' ) tb1 ) tb2 GROUP BY N1 HAVING COUNT(1) > 1 ) tb3 ) ORDER BY vendorid3,vendor,model3,sort3,color3,size3,unitpri3 ) tb1 ) actual_data UNION ALL -- 插入空白行 SELECT NULL AS no3, NULL AS date3, NULL AS vendorid3, NULL AS vendor, NULL AS model3, NULL AS sort3, NULL AS color3, NULL AS size3, NULL AS unitpri3, NULL AS N1, 1 AS is_blank, group_id FROM ( SELECT group_id FROM ( SELECT group_id, MAX(rn) AS last_in_group FROM ( -- 同上实际数据查询... -- 此处重复实际数据查询部分,用于获取组信息 ) grouped_data GROUP BY group_id ) group_endings ) blank_rows ) final_result ORDER BY CASE WHEN is_blank = 0 THEN rn ELSE last_in_group + 0.5 END, is_blank; 这段代码,请按照我提供的源码,全部发完整的
08-21
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值