sql批处理的表-表及表字段-表字段之间的血缘分析的简单实现思路

本文链接：https://blog.youkuaiyun.com/zhzhair_2015/article/details/142530869

一、引言：通过批量sql脚本可以根据最终的目标表和对应的目标字段找到依赖表（来源表）和来源字段，也可以从基本的源表一层一层的追踪得到加工表的元数据。例如，我们要将每个班的总分最高的学生和班级信息数据放到优秀学生表里，sql如下：

insert into best_student (class_team, student_name, total_score, result_degree)
select A.c_team,B.s_name,A.m_score, 
    case when A.m_score = 100 then 'A' when A.m_score > 95 then 'B' else 'C' end as result_degree
from (select c_team, max(t_score) as m_score from score_report group by c_team) A
inner join score_report B on A.class_team = B.class_team and A.total_score = B.m_score;

上述sql语句的目标表-目标字段是

best_student-class_team, best_student-student_name, best_student-total_score, best_student-result_degree；

对应来源表-来源字段是

score_report-c_team, score_report-s_name,score_report-t_score,score_report-t_score;

二、下面我们给出如何用处理字符串的简单方式找出表及其字段的来源表及其来源字段：

1.因为sql语句的tab键和space键和换行键等可以随意增加，所以第一步是将文本内容格式化：将所有tab和换行以及加减乘除取余等字符串替换成“ ”，将引号包裹起来的字符串剔除掉（字符串里的特殊符号会影响字符串解析）。同时，为了insert后面字段能跟select后面逗号拆分开的文本一一对应，需替换掉函数括号内有“,”的逗号为“|”，比如substr(field, 1, 2)替换成substr(field|1|2)。

2.根据分号";"拆分成多个片段，并将sql中的子查询提出来作为临时中间表（递归的方式截取出来所有子查询，直到没有子查询存在为止），比如引言部分的sql就可以写成如下形式：

insert into best_student (class_team, student_name, total_score, result_degree)
select A.c_team,B.s_name,A.m_score, 
    case when A.m_score = 100 then 'A' when A.m_score > 95 then 'B' else 'C' end as 
    result_degree
from score_report_manual_son_1 A
inner join score_report B on A.class_team = B.class_team and A.total_score = B.m_score;

drop table if exists score_report_manual_son_1;
create table score_report_manual_son_1
(
    c_team, 
    m_score
);
insert into score_report_manual_son_1
(
    c_team, 
    m_score
)
select c_team, max(t_score) as m_score from score_report group by c_team;

3.从目标表插入数据或目标表更新数据的sql语句开始找出目标表字段对应的来源表及来源字段，当然也可能是临时中间表及其虚拟字段。如果是临时表，则继续用递归的方式追踪到来源表和来源字段。其中，查找过程中需要找出表别名和包含的字段关系（表字段可以从建表语句或者数据库查询出来），具体做法见以下第4部分。

4.查找来源表和来源字段：如引言的案例，目标表第四个字段是result_degree，对应select部分的sql字符串是case when A.m_score = 100 then 'A' when A.m_score > 95 then 'B' else 'C' end as result_degree。可以看出来，临时来源表和虚拟来源字段是score_report_manual_son_1和m_score，再追踪就到了来源表score_report和来源字段t_score，解析步骤如下：

i) 定义包含方法contains(text, word, seperatorArray)，假设seperatorArray=['|', ' ', '(', ')', '>', '<', '=', '.']。如果word两边拼接上seperatorArray的一个字符是text的子串或者text以word加seperatorArray的一个字符开始或者text以seperatorArray的一个字符加word结束，该方法就返回true，否则返回false。

ii) best_student-result_degree对应score_report_manual_son_1-(case when A.m_score = 100 then 'A' when A.m_score > 95 then 'B' else 'C' end as result_degree)，根据别名找出A就是score_report_manual_son_1，并且score_report_manual_son_1的字段m_score满足

contains(case when A.m_score = 100 then 'A' when A.m_score > 95 then 'B' else 'C' end, m_score, seperatorArray) = true，于是就找到了临时来源表和虚拟来源字段是score_report_manual_son_1和m_score。接着继续递归往下找，就找出来源表和来源字段分别是score_report和t_score。