以下是同事人员发过来的sql,让我帮他优化一下,clickhouse的语句,需要替换成在自己的数据库上执行,dth_diagnosis_all 这个表的数据量90217505 ,千万级的数据量,说是内存不足,sql执行不了。
select sum(case when re.diagnosis_code = 'E10' then 1 else 0 end) as E10Count,
sum(case when re.diagnosis_code = 'E11' then 1 else 0 end) as E11Count,
sum(case when re.diagnosis_code != 'E10' and re.diagnosis_code != 'E11' then 1 else 0 end) as otherCount
from (select t1.id_no as id_no, left(t1.diagnosis_code,3) as diagnosis_code
from (select t2.id_no, t2.org_code, t2.event_type_code, t2.org_treatment_no,
max(t2.register_date) as register_date,t2.diagnosis_code
from dth.dth_diagnosis_all t2
where t2.register_date <= #{initYear}
and left(t2.diagnosis_code,3) in('E10','E11','E12','E13','E14')
group by t2.id_no,t2.org_code,t2.event_type_code,t2.org_treatment_no,t2.diagnosis_code)t3,
dth.dth_diagnosis_all t1
where t1.id_no = t3.id_no
and t1.org_code = t3.org_code and t1.diagnosis_code = t3.diagnosis_code
and t1.event_type_code = t3.event_type_code
and t1.org_treatment_no = t3.org_treatment_no
and t1.register_date = t3.register_date
and t1.org_province = #{province}
and t1.org_city = #{city}
)re left join (select * from dth_ehr_pir_all
where birth_date <= #{initYear}
and register_province = #{province}
and register_city = #{city}
and diabetes_flag != '0'
and (death_date = ''
or death_date > #{initYear})) t
on t.id_no = re.id_no
where t.birth_date <= #{initYear}
and t.register_province = #{province}
and t.register_city = #{city}
and t.diabetes_flag != '0'
and (t.death_date = '' or t.death_date > #{initYear})
参数:initYear(查询年份12月31,如2029-12-31), #{province},查询省份:(山东省), #{city} :查询城市
两个子查询的数据量
以下为优化后的sql,看这个sql子查询t1和t3,其实是同一张表,然后对同一张表做连接查询,对t3表做了where条件过滤(这个不用改),之后连接查询之后又对t1表进行了where条件过滤,如下:
t1.org_province = #{province}
and t1.org_city = #{city}
为啥不提前提到t1子查询里面去呢,
再看一下join条件中
and t1.diagnosis_code = t3.diagnosis_code
and t1.register_date = t3.register_date,
说明t3和t1连接条件是一样的,又是同一张表,t3中也对这两个列做了where限制,那么子查询t1中也可以加入这两个条件,于是把一下四个条件提到t1的子查询中去,大大降低了临时表占用的内存空间,
t1.org_province = #{province}
and t1.org_city = #{city}
and register_date <= #{initYear}
and left(diagnosis_code,3) in('E10','E11','E12','E13','E14')
(
select *
from dth.dth_diagnosis_all
where
org_province = #{province} and org_city = #{city}
and register_date <= #{initYear}
and left(diagnosis_code,3) in('E10','E11','E12','E13','E14')
) t1
然后就是针对 t 表做优化,子查询中已经对 t 表做了where条件过滤,连接查询后面又对 t 表进行单独的过滤,这个完全没必要再加同样的条件了。
之后再看最外层的
sum(case when re.diagnosis_code = 'E10' then 1 else 0 end) as E10Count,
sum(case when re.diagnosis_code = 'E11' then 1 else 0 end) as E11Count,
sum(case when re.diagnosis_code != 'E10' and re.diagnosis_code != 'E11' then 1 else 0 end) as otherCount
这个一看就是对条件进行统计个数,可以改成效率更高的count
改下如下:
count(case when re.diagnosis_code = 'E10' then '1' else null end) as E10Count,
count(case when re.diagnosis_code = 'E11' then '1' else null end) as E11Count,
count(case when re.diagnosis_code != 'E10' and re.diagnosis_code != 'E11' then '1' else null end) as otherCount
至此改完之后的结果
select
count(case when re.diagnosis_code = 'E10' then '1' else null end) as E10Count,
count(case when re.diagnosis_code = 'E11' then '1' else null end) as E11Count,
count(case when re.diagnosis_code != 'E10' and re.diagnosis_code != 'E11' then '1' else null end) as otherCount
from
(
select
t1.id_no as id_no,
left(t1.diagnosis_code,3) as diagnosis_code
from
(
select
t2.id_no, t2.org_code, t2.event_type_code, t2.org_treatment_no,
max(t2.register_date) as register_date,
t2.diagnosis_code
from dth.dth_diagnosis_all t2
where
t2.register_date <= #{initYear}
and left(t2.diagnosis_code,3) in('E10','E11','E12','E13','E14')
group by
t2.id_no,t2.org_code,t2.event_type_code,t2.org_treatment_no,t2.diagnosis_code
) t3
inner join
(
select *
from dth.dth_diagnosis_all
where
org_province = #{province} and org_city = #{city}
and register_date <= #{initYear}
and left(diagnosis_code,3) in('E10','E11','E12','E13','E14')
) t1
on
t1.id_no = t3.id_no
and t1.org_code = t3.org_code
and t1.diagnosis_code = t3.diagnosis_code
and t1.event_type_code = t3.event_type_code
and t1.org_treatment_no = t3.org_treatment_no
and t1.register_date = t3.register_date
) re
left join
(
select *
from dth_ehr_pir_all
where
birth_date <= #{initYear}
and register_province = #{province}
and register_city = #{city}
and diabetes_flag != '0'
and (death_date = '' or death_date > #{initYear})
) t
on t.id_no = re.id_no;
sql能跑出来了,不会报出内存不足了