目录
本篇说明分段维度的实现技术。分段维度包含连续值的分段,例如,年度销售订单分段维度可能包含有叫做“低”、“中”、“高”的三档,各档定义分别为 0.01 到 15000、15000.01 到 30000.00、30000.01 到 99999999.99。如果一个客户的年度销售订单金额为 10000,则被归为“低”档。
分段维度可以存储多个分段集合,例如,可能有一个用于促销分析的分段集合,另一个用于市场细分,可能还有一个用于销售区域计划。分段一般由用户定义,而且很少能从交易源数据直接获得。
一、年度销售订单星型模式
本节说明如何实现一个年度订单分段维度,需要两个新的星型模式,如下图所示。星型模式的事实表使用(关联到)已有的 customer_dim 和一个新的 year_dim 表,年维度是日期维度的子集。annual_customer_segment_fact 是唯一用到 annual_order_segment_dim 表的表。annual_order_segement_dim 是分段维度表。
annual_order_segment_dim 表存储多个分段集合,在下面的例子里将两个分段集合“project alpha”和“grid”导入 annual_order_segment_dim 表。这两种分段集合都是按照用户的年度销售订单金额将其分类,project alpha分六段,grid分三段。下表显示了这个分段的例子。
Segment Name | Band Name | Start Value | End Value |
PROJECT ALPHA | Bottom | 0.01 | 2500.00 |
PROJECT ALPHA | Low | 2500.01 | 3000.00 |
PROJECT ALPHA | Mid-low | 3000.01 | 4000.00 |
PROJECT ALPHA | Mid | 4000.01 | 5500.00 |
PROJECT ALPHA | Mid-high | 5500.01 | 6500.00 |
PROJECT ALPHA | Top | 6500.01 | 99999999.99 |
Grid | LOW | 0.01 | 3000.00 |
Grid | MED | 3000.01 | 6000.00 |
Grid | HIGH | 6000.01 | 99999999.99 |
每一分段有一个开始值和一个结束值。分段的粒度就是本段和下段之间的间隙,粒度必须是度量的最小可能值,在销售订单金额的示例中是 0.01。最后一个分段的结束值是销售订单金额可能的最大值。下面的脚本用于建立分段维度数据仓库模式。
use dw;
create table annual_order_segment_dim (
segment_sk int,
segment_name varchar(30),
band_name varchar(50),
band_start_amount decimal(10,2),
band_end_amount decimal(10,2),
version int,
effective_date date,
expiry_date date
)
clustered by (segment_sk) into 8 buckets
stored as orc tblproperties ('transactional'='true');
insert into annual_order_segment_dim values (1, 'project alpha', 'bottom', 0.01, 2500.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values (2, 'project alpha', 'low', 2500.01, 3000.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values (3, 'project alpha', 'mid-low', 3000.01, 4000.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values (4, 'project alpha', 'mid', 4000.01, 5500.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values (5, 'project alpha', 'mid_high', 5500.01, 6500.00, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values (6, 'project alpha', 'top', 6500.01, 99999999.99, 1, ' 1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values (7, 'grid', 'low', 0.01, 3000, 1, '1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values (8, 'grid', 'med', 3000.01, 6000.00, 1, ' 1900-01-01', '2200-01-01');
insert into annual_order_segment_dim values (9, 'grid', 'high', 6000.01, 99999999.99, 1, '1900-01-01', '2200-01-01');
create table year_dim (
year_sk int,
year int
);
create table annual_sales_order_fact (
customer_sk int,
year_sk int,
annual_order_amount decimal(10, 2)
);
create table annual_customer_segment_fact (
segment_sk int,
customer_sk int,
year_sk int
);
二、初始装载
本节说明初始装载并进行测试。下面的初始装载脚本将 order_date 维度表(date_dim 表的一个视图)里的数据导入 year_dim 表,将 sales_order_fact 表里的数据导入 annual_sales_order_fact 表,将 annual_sales_order_fact 表里的数据导入 annual_customer_segment_fact 表。此脚本装载所有历史数据。
use dw;
insert into year_dim
select row_number() over (order by t1.year) + t2.sk_max, year
from (select distinct year year from order_date_dim) t1
cross join (select coalesce(max(year_sk),0) sk_max from year_dim) t2;
insert into annual_sales_order_fact
select a.customer_sk,
year_sk,
sum(order_amount)
from sales_order_fact a,
year_dim c,
order_date_dim d
where a.order_date_sk = d.order_date_sk
and c.year = d.year
and d.year < 2017
group by a.customer_sk, c.year_sk;
insert into annual_customer_segment_fact
select d.segment_sk,
a.customer_sk,
a.year_sk
from annual_sales_order_fact a,
annual_order_segment_dim d
where annual_order_amount >= band_start_amount
and annual_order_amount <= band_end_amount;
执行初始装载脚本,查询 annual_customer_segment_fact 表确认初始装载是成功的。
select a.customer_sk csk,
a.year_sk ysk,
annual_order_amount amt,
segment_name sn,
band_name bn
from annual_customer_segment_fact a,
annual_order_segment_dim b,
year_dim c,
annual_sales_order_fact d
where a.segment_sk = b.segment_sk
and a.year_sk = c.year_sk
and a.customer_sk = d.customer_sk
and a.year_sk = d.year_sk
cluster by csk, ysk, sn, bn;
查询结果如下图所示。
注意,这里是按客户代理键 customer_sk 分组求和来判断分段,实际情况可能是以 customer_number 进行分组的,因为无论客户的 scd 属性如何变化,一般还是认为是一个客户
三、定期装载
本节说明定期装载脚本和如何测试它。除了无需装载 year_dim 表以外,定期装载与初始装载类似。annual_sales_order_fact 表里的数据被导入 annual_customer_segment_fact 表。每年调度执行下面的定期装载脚本,此脚本装载前一年的销售数据。
use dw;
insert into annual_sales_order_fact
select a.customer_sk,
year_sk,
sum(order_amount)
from sales_order_fact a,
year_dim c,
order_date_dim d
where a.order_date_sk = d.order_date_sk
and c.year = d.year
and d.year = year(current_date) - 1
group by a.customer_sk, c.year_sk;
insert into annual_customer_segment_fact
select d.segment_sk,
a.customer_sk,
c.year_sk
from annual_sales_order_fact a,
year_dim c,
annual_order_segment_dim d
where a.year_sk = c.year_sk
and c.year = year(current_date) - 1
and annual_order_amount >= band_start_amount
and annual_order_amount <= band_end_amount;