简化SQL式计算之多层固定分组

本文介绍了一种复杂的SQL计算——多层固定分组,并通过一个具体的库存状态计算实例展示了如何利用集算器进行高效处理。

多层固定分组也是我们经常面对的一种复杂SQL式计算。实现该算法的核心思路是用left join语句将源数据按照固定的依据对齐,但由于该算法往往涉及分组汇总、行间计算、填补缺失数据,而且层次较多,因此相应的SQL语句会很复杂。

集算器可以实现多层固定分组,代码简单易懂,下面用一个例子来说明。

   表stocklog存储着每天多种货物的多次出入库记录,要计算出指定时间段内每天每种货物的库存状态。表stocklog的部分数据如下:



 

其中Indicator的值如果为空,则表示该记录是入库动作,如果为ISSUE,则表示出库。需要注意的是,这里的日期也许有缺失,即某几天完全没有出入库记录,但库存状态必须包含完整的连续日期。

库存状态是每天每种产品的开库时数量(Open)、入库数量(Enter)、最高库存(Total)、出库数量(Issued)、闭库时数量(Close)。其中:当日的“Open”等于前一日的“Close”,“Enter”和“Issued”来自于表stocklog,“Total”等于“Open+Enter”,“Close”等于“Open+Enter-Issued”或者是“Total-Issued”。

   集算器代码如下:



 

       A1:查询数据库,根据表stocklog计算出每种产品每天总的入库数量和出库数量。这里只需要对数据进行分组汇总,计算上没有难度,可以交给SQL语句去完成。值得注意的是,A1中有两个参数startend,分别对应SQL语句中的两个问号,这代表外部传入的时间段,可以来自于JAVA或报表工具。假设startend的值分别为2014-04-01”和“2014-04-10”,则A1的计算结果如下:



 

A2=A1.group(Lname)

   这句代码将A按照Lname分组,每组数据是一个产品每天的总出入库数量。值得注意的是,这里无需对分组后的数据进行汇总计算。A2的计算结果如下图左侧,右侧是每组数据的明细。



 

关于分组,集算器有两个函数:groupsgroup。函数groups类似于SQL中的group by语句,可以在分组的同时进行汇总。而group函数只分组,不做汇总,这是SQL缺乏的功能。

 

最终的计算结果需要startend之间每一天的库存状态,而源数据并非每天都有出入库记录,因此要把A2按照连续的时间序列对齐。下面先生成这个时间序列。

B2=periods(start,end,1)

函数periods可以生成时间序列,有三个参数:起始时间、终止时间、间隔。缺省将生成日期序列,使用选项还可以生成年、季、月、旬的时间序列。A3的计算结果如下:



 

A3=for A2,这是循环语句,表示对A2进行循环,每次计算一个产品。

B3-B6是循环体,具体算法是先将该产品的出入库记录与B2中的时间序列对齐,然后计算每个产品每天的库存状态,最后将计算结果追加到B6中。值得注意的是,集算器使用直观的缩进来表示循环体,而不是括号或begin/end等标识符。B3-B6就是循环语句A3的循环体。

 

B3=A3.align(A3,Date)

这句代码将当前产品的出入库记录与B2中的时间序列对齐。注意,A3既是循环语句,也是循环变量,即当前产品的出入库记录。以产品item3为例,下图左侧是对齐前的数据,右侧是对齐后的数据:



 

B4>c=0

这句代码用来给变量c赋初值0c代表当前产品每条库存状态的Open字段,初始日期的Open字段为0c会在B5中不断被修改。

 

B5=B3.new(A3.Lname:Lname,B2(#):LDate, c:Opening, Enter,(b=c+Enter):Total,Issue,(c=b-Issue):Close)

这句代码用来计算库存状态。B3.new(…)表示以B3为基础新建一个序表,即当前产品的库存状态。新序表中有7个字段,如下:

A3.Lname:Lname----从当前产品的出入库记录A3取出Lname字段,新字段名为Lname

B2 (#):LDate ----将时间序列B2按顺序插入新序表,作为新字段LDate。注意,#表示A3的记录序号,B2(N)表示B2的第N条记录,因此B2(#)表示按A3的序号将B2插入新序表。

c:Open----将变量c作为Open的字段值,第一条记录时,这个值为0

Enter----B3中的字段Enter直接当做新字段。由于新序表是基于B3的,因此无需像Lname字段那样重命名。

(b=c+Enter):Total----按公式Open+Enter计算出字段Total,为了清晰起见,这里用括号把表达式括起来。

Issue---B3中的字段Issue直接当做新字段。

(c=b-Issue):Close---按公式Total-Issued计算Close字段。注意,这里的c已经被修改了,在计算下一条记录时,c会作为Open字段的值,从而满足业务规则:当日的“Open”等于前一日的“Close”。

item3为例,B5的计算结果如下:



 

B6=@|B5

 

这句代码将B5不断地追加到当前格中,@表示当前格B6,最终计算结果如下:



 

B6就是本案例的最终计算结果。

 

 

   另外,集算器可被报表工具或java程序调用,调用的方法也和普通数据库相似,使用它提供的JDBC接口即可向java主程序返回ResultSet形式的计算结果,具体方法可参考相关文档。

 

WITH email AS ( SELECT tenant_id, short_session_id, session_id, session_subject, product_id, strategy_id, customer_id, customer_email, staff_id, first_staff_id, session_status, offline_status, queue_status, session_type, email_group_id, first_email_group_id, CASE WHEN DATE(start_time) < '2025-05-22' THEN FROM_UNIXTIME(UNIX_TIMESTAMP(start_time) - 3600) ELSE start_time END AS start_time, CASE WHEN DATE(end_time) < '2025-05-22' THEN FROM_UNIXTIME(UNIX_TIMESTAMP(end_time) - 3600) ELSE end_time END AS end_time, CASE WHEN DATE(create_time) < '2025-05-22' THEN FROM_UNIXTIME(UNIX_TIMESTAMP(create_time) - 3600) ELSE create_time END AS create_time FROM lods.lods_smartxma_workorder_email_session WHERE customer_email NOT IN ( '1178627646@qq.com', 'advertise-noreply@support.facebook.com', '360541391@qq.com', '554041631@qq.com', 'ivan.gonzalezcruz@icloud.com', 'notification@facebookmail.com', 'adrianaguilera834@gmail.com', 'reminders@facebookmail.com', 'zjy703@qq.com', 'xueshi_oo@163.com', 'miaandrea2530@gmail.com', 'jiayannan0724@126.com', 'dani.coyolxauhqui63@gmail.com', 'miaandrea2530@gmail.com', 'miguel.cetis37fv@gmail.com', 'Miki.cekan63@gmail.com', 'miguel.hernandez21918@gmail.com', 'Ivan.gonzalezcruz@icloud.com', 'ivangzcruz@hotmail.com', 'Adrianaguilera834@gmai.com', 'Cobranzaprevia925@gmail.com', 'rtrinidadmaldonado@gmail.com','zjy703@qq.com', 'jiaxiong.wen@xiaolongcloud.com', '360541391@qq.com', 'suhuanzheng7784877@163.com' ) and customer_email not like '%qq.com' and customer_email not like '%163.com' and customer_email not like '%xiaolongcloud.com' and session_subject not like '%Prueba%' and ( session_subject NOT RLIKE '[\\u4e00-\\u9fa5]' or session_subject like '%外部邮件%') and session_subject not like '%test%' ), note ( SELECT DISTINCT session_id , field_val from lods.lods_smartxma_workorder_email_session_question_ext where field_val is not null and field_val!='' and field_name= 'Notas de servicio' ), response_time AS ( SELECT session_id, MIN(send_msg_time) AS f_send_msg_time FROM lods.lods_smartxma_workorder_email_session_message WHERE from_terminal = 'kf_web' AND msg_type = 1 GROUP BY session_id ), xiaojie AS ( SELECT * FROM lods.lods_smartxma_basic_basic_summary ), bf AS ( SELECT session_id, SPLIT(question, '-')[0] AS brief_sum1, SPLIT(question, '-')[1] AS brief_sum2, case when to_date(create_time)>='2025-07-24' then '' else SPLIT(question, '-')[2] end AS brief_sum3 FROM lods.lods_smartxma_workorder_email_session_question ),email_base AS ( SELECT *, LAG(create_time) OVER ( PARTITION BY customer_email ORDER BY create_time ) AS prev_time FROM email ), email_groups AS ( SELECT *, -- 邮箱内分组 (48小时连续) SUM(CASE WHEN prev_time IS NULL OR unix_timestamp(create_time) - unix_timestamp(prev_time) > 172800 THEN 1 ELSE 0 END) OVER ( PARTITION BY customer_email ORDER BY create_time ) AS group_id, -- 标记重复进线 CASE WHEN prev_time IS NOT NULL AND unix_timestamp(create_time) - unix_timestamp(prev_time) <= 172800 THEN '48h重复进线' ELSE NULL END AS repeat_mark FROM email_base ), group_info AS ( SELECT customer_email, group_id, -- 正确获取每个组的最早创建时间 MIN(create_time) AS group_start_time ,max(start_time) as group_max_time FROM email_groups GROUP BY customer_email, group_id ), -- 全局问题ID分配 global_groups AS ( SELECT customer_email, group_id, group_start_time,group_max_time, ROW_NUMBER() OVER (ORDER BY group_start_time) AS question_id FROM group_info ) INSERT OVERWRITE TABLE dm_opr.dm_opr_email_realtime_df PARTITION(dt) SELECT distinct eg.tenant_id, eg.short_session_id, eg.session_id, eg.session_subject, eg.product_id, eg.strategy_id, eg.customer_id, eg.customer_email, eg.staff_id, eg.first_staff_id, eg.session_status, eg.offline_status, eg.queue_status, eg.session_type, eg.email_group_id, eg.first_email_group_id, eg.start_time, eg.end_time, eg.create_time, eg.repeat_mark, gg.question_id, gg.group_start_time,gg.group_max_time, f.unique_id AS uniqueid, case when SPLIT(b.channel_name, '_')[0] ='Ala' then 'ala' else SPLIT(b.channel_name, '_')[0] end AS app_code, response_time.f_send_msg_time, bf.brief_sum1, bf.brief_sum2, bf.brief_sum3, note.field_val as notas , UNIX_TIMESTAMP(response_time.f_send_msg_time) - UNIX_TIMESTAMP(eg.start_time) AS avg_responstime, g.name AS staff_name, CASE WHEN w.unique_id IS NOT NULL THEN '凭证' ELSE '咨询' END AS type , case when oc.call_id is not null and oc.answer_time is not null then '已外呼客户接起' when oc.call_id is not null then '已外呼未接起' else '未外呼' end as type2, case when sw.session_id is not null then '有号码' else '无号码' end as type3, date(eg.create_time) as dt FROM email_groups eg JOIN global_groups gg ON eg.customer_email = gg.customer_email AND eg.group_id = gg.group_id LEFT JOIN response_time ON response_time.session_id = eg.session_id LEFT JOIN bf ON bf.session_id = eg.session_id LEFT JOIN ( SELECT channel_id, channel_name FROM lods.lods_smartxma_workorder_wo_email_channel ) b ON eg.product_id = b.channel_id LEFT JOIN ( SELECT * FROM lods.lods_smartxma_basic_basic_customer_identity ) k ON k.customer_id = eg.customer_id LEFT JOIN ( SELECT distinct unique_id, default.denc2md5(email) AS email FROM dwd.dwd_ip_cust_info_df WHERE dt = REPLACE(DATE_SUB(CURRENT_DATE(), 1), '-', '') ) f ON f.email = denc2md5(eg.customer_email) LEFT JOIN ( SELECT * FROM lods.lods_smartxma_basic_basic_user_global ) g ON cast(eg.staff_id as string)= cast(g.id as string) LEFT JOIN ( SELECT distinct unique_id, app_code, mobile, create_time, body_text FROM lods.lods_smartxma_workbench_whats_app_send_record where body_text LIKE '%.com%' ------- AND send_state = '1' ) w ON w.unique_id = f.unique_id AND cast(w.create_time as string) <= cast(eg.create_time as string) left join dm_opr.dm_opr_out_call_df oc on oc.uniqueid=f.unique_id and cast(gg.group_max_time as string) <= cast(oc.room_start_time as string) left join ( SELECT distinct session_id , msg_content FROM lods.lods_smartxma_workorder_email_session_message WHERE from_terminal ='user_email' and msg_type=1 and msg_content RLIKE '\\d{10,}' -- 匹配10位及以上数字 AND NOT msg_content RLIKE '(?i)zkj\\d{10}' ) sw on sw.session_id=eg.session_id left join note on note.session_id=eg.session_id; 以上是我的逻辑中间表逻辑,下面是我的计算指标sql,我想要计算时候取源表数据来计算,指标sql该怎么修改 select to_date(create_time) as date, '总计' as brief_sum2, app_code, count(distinct customer_email) as `邮件进线客户数` from dm_opr.dm_opr_email_realtime_df WHERE to_date(create_time) >=DATE_SUB(CURRENT_DATE(), 3) and app_code='pulsar' group by 1,2,3 order by 1 desc limit 1;
最新发布
11-06
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值