1、需求:找到每个分组中pv最大的第一条数据,取10万条;
row_number 格式:ROW_NUMBER() OVER (partition BY COLUMN_A ORDER BY COLUMN_B ASC/DESC) rn
其中:
rn 是排序的别名执行时每组的编码从1开始
partition by:类似hive的建表,分区的意思;COLUMN_A 是分组字段
order by :排序,默认是升序,加desc降序;COLUMN_B 是排序字段
注解:其实这个排序具体理解如下:
1、按照sql中的group 操作之后会得到很多的分类;
2、这个分类不变,按照 partition by 后的字段COLUMN_A 作为ROW_NUMBER要分组字段,如果字段相同,则会产生1到N的行号;
3 、orderby 的作用就是按照那个字段值来排序,以此产生是从到大到小还是从小到大的排序;
本文查询用例:
select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn ORDER BY ro.pv desc) AS rn from rong_getCustomers_data ro join (select msisdn,max(pv) pv from rong_getCustomers_data group by msisdn)st1 on ro.msisdn=st1.msisdn and ro.pv=st1.pv )stt where rn <=1 and length(stt.msisdn)=11 order by stt.pv desc limit 100000;
查询结果:(正确取到数据)
***********************************************************************************
[evercloud17:21000] > select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn ORDER BY ro.pv desc) AS rn from rong_getCustomers_data ro join (select msisdn,max(pv) pv from rong_getCustomers_data group by msisdn)st1 on ro.msisdn=st1.msisdn and ro.pv=st1.pv )stt where rn <=1 and length(stt.msisdn)=11 order by stt.pv desc limit 10;
Query: select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn ORDER BY ro.pv desc) AS rn from rong_getCustomers_data ro join (select msisdn,max(pv) pv from rong_getCustomers_data group by msisdn)st1 on ro.msisdn=st1.msisdn and ro.pv=st1.pv )stt where rn <=1 and length(stt.msisdn)=11 order by stt.pv desc limit 10
+----------------------------------+---------+
| default.md5(stt.msisdn) | os_code |
+----------------------------------+---------+
| f2f15925d59d38675612bcf70dc114e9 | 1 |
| 38b63fee0f74c0ffdf3572fd4a9d5e64 | 1 |
| 94b03cf77619a0e9e0309ad1c0a1a46b | 1 |
| a5ff2002635605e8554ee324f7051932 | 3 |
| d7a83461d3c12ebb7b1ed17eb5f64dd2 | 1 |
| f037d0028f8da6d83695ec6e54229db2 | 1 |
| d2c955d29caba4bba41f280f0aee88b7 | 1 |
| 448f4cc33a538a2c6c9fc003f34ca569 | 1 |
| 337c229108d579fea706b57b3212d690 | 1 |
| 39c2352f194d5b2cf077f8708d4a6665 | 1 |
+----------------------------------+---------+
Fetched 10 row(s) in 3.54s
2、测试其有聚合函数时候的写法和正确性;
2.1 聚合函数直接写:
select label,ecid,eid,findedname,ppv from (select ap.label,st.ecid,st.eid,ap.findedname,sum(pv) ppv,ROW_NUMBER() OVER(PARTITION BY ap.label ORDER BY sum(pv) desc) AS rn from statistic_day_app_user st join app_label_class ap on st.ecid=ap.ecid and st.eid=ap.eid where find_in_set(ap.label,'旅游出行,网上购物,金融理财')>0 and day>=20180806 and day<=20180906 group by ap.label,st.ecid,st.eid,ap.findedname)sts where rn <=50 ;
其中rownum 的order by 字段用聚合函数:就是用的分组后聚合---按照原来的分组聚合,不是按照COLUMN_A 分组聚合;
2.2 为了测试正确性--分开写,先统计后再用row_number;
select * from (select label,ecid,eid,findedname,ppv,ROW_NUMBER() OVER(PARTITION BY label ORDER BY ppv desc) AS rn from( select ap.label,st.ecid,st.eid,ap.findedname,sum(pv) ppv from statistic_day_app_user st join app_label_class ap on st.ecid=ap.ecid and st.eid=ap.eid where find_in_set(ap.label,'旅游出行,网上购物,金融理财')>0 and day>=20180806 and day<=20180906 group by ap.label,st.ecid,st.eid,ap.findedname)stt)sttt where rn <=50;
2.3、测试结果:
(1)、整体写的结果:
Query: select label,ecid,eid,findedname,ppv from (select ap.label,st.ecid,st.eid,ap.findedname,sum(pv) ppv,ROW_NUMBER() OVER(PARTITION BY ap.label ORDER BY sum(pv) desc) AS rn from statistic_day_app_user st join app_label_class ap on st.ecid=ap.ecid and st.eid=ap.eid where find_in_set(ap.label,'旅游出行,网上购物,金融理财')>0 and day>=20180806 and day<=20180906 group by ap.label,st.ecid,st.eid,ap.findedname)sts where rn <=50 ;
+----------+------+------+-----------------------------------------------+-------------+----+
| label | ecid | eid | findedname | ppv | rn |
+----------+------+------+-----------------------------------------------+-------------+----+
| 金融理财 | 9 | 3 | 支付宝 | 1274317627 | 1 |
| 金融理财 | 9 | 6 | 财付通 | 258357291 | 2 |
| 金融理财 | 21 | 8108 | 同花顺炒股票 | 242589336 | 3 |
| 金融理财 | 9 | 5 | 招商银行 | 203848183 | 4 |
| 金融理财 | 16 | 7128 | 平安普惠 | 178164508 | 5 |
| 金融理财 | 16 | 3 | 同花顺股票开户 | 142253306 | 6 |
| 金融理财 | 16 | 51 | 财富 | 131791277 | 7 |
| 金融理财 | 21 | 7512 | 百度钱包 | 104820340 | 8 |
| 金融理财 | 9 | 9 | 中国建设银行 | 77952913 | 9 |
| 金融理财 | 9 | 12 | 农行掌上银行 | 74952353 | 10 |
(2)、分开写结果:
Query: select * from (select label,ecid,eid,findedname,ppv,ROW_NUMBER() OVER(PARTITION BY label ORDER BY ppv desc) AS rn from( select ap.label,st.ecid,st.eid,ap.findedname,sum(pv) ppv from statistic_day_app_user st join app_label_class ap on st.ecid=ap.ecid and st.eid=ap.eid where find_in_set(ap.label,'旅游出行,网上购物,金融理财')>0 and day>=20180806 and day<=20180906 group by ap.label,st.ecid,st.eid,ap.findedname)stt)sttt where rn <=50
+----------+------+------+-----------------------------------------------+-------------+----+
| label | ecid | eid | findedname | ppv | rn |
+----------+------+------+-----------------------------------------------+-------------+----+
| 金融理财 | 9 | 3 | 支付宝 | 1274317627 | 1 |
| 金融理财 | 9 | 6 | 财付通 | 258357291 | 2 |
| 金融理财 | 21 | 8108 | 同花顺炒股票 | 242589336 | 3 |
| 金融理财 | 9 | 5 | 招商银行 | 203848183 | 4 |
| 金融理财 | 16 | 7128 | 平安普惠 | 178164508 | 5 |
| 金融理财 | 16 | 3 | 同花顺股票开户 | 142253306 | 6 |
| 金融理财 | 16 | 51 | 财富 | 131791277 | 7 |
| 金融理财 | 21 | 7512 | 百度钱包 | 104820340 | 8 |
| 金融理财 | 9 | 9 | 中国建设银行 | 77952913 | 9 |
| 金融理财 | 9 | 12 | 农行掌上银行 | 74952353 | 10 |
结论:两种写法均正确;