离线数仓DWS层
数据服务层:DWS(Data WareHouse Servce)公共汇总层
存储公用的中间结果,减少重复计算,提高中间结果的复用性。依托于ADS层的需求,商榷中间公共结果(需求驱动)
设计要点:
(1)DWS层的设计参考链接: 数仓指标体系
(2)DWS层的数据存储格式为ORC列式存储 + snappy压缩。
(3)DWS层表名的命名规范为dws_数据域_统计粒度_业务过程_统计周期(1d/nd/td)
DWS 层为公共汇总层,会进行轻度汇总,粒度比明细数据稍粗,基于 DWD 层上的基础数据,整合汇总成分析某一个主题域的服务数据,一般是宽表。DWS 层应覆盖 80% 的应用场景。又称数据集市或宽表。
按照业务划分,如主题域流量、订单、用户等,生成字段比较多的宽表,用于提供后续的业务查询,OLAP 分析,数据分发等。
一般来讲,该层的数据表会相对比较少,一张表会涵盖比较多的业务内容,由于其字段较多,因此一般也会称该层的表为宽表。
DWS公共汇总层设计规范
数据仓库的性能是数据仓库建设是否成功的重要标准之一。聚集主要是通过汇总明细粒度数据来获得改进查询性能的效果。通过访问聚集数据,可以减少数据库在响应查询时必须执行的工作量,能够快速响应用户的查询,同时有利于减少不同用访问明细数据带来的结果不一致问题。(聚集的作用)
-
聚集的基本原则
● 一致性。聚集表必须提供与查询明细粒度数据一致的查询结果。
● 避免单一表设计。不要在同一个表中存储不同层次的聚集数据。
● 聚集粒度可不同。聚集并不需要保持与原始明细粒度数据一样的粒度,聚集只关心所需要查询的维度。 -
聚集的基本步骤
第一步:确定聚集维度
在原始明细模型中会存在多个描述事实的维度,如日期、商品类别、卖家等,这时候需要确定根据什么维度聚集,如果只关心商品的交易额情况,那么就可以根据商品维度聚集数据。
第二步:确定一致性上钻
这时候要关心是按月汇总还是按天汇总,是按照商品汇总还是按照类目汇总,如果按照类目汇总,还需要关心是按照大类汇总还是小类汇总。当然,我们要做的只是了解用户需要什么,然后按照他们想要的进行聚集。
第三步:确定聚集事实
在原始明细模型中可能会有多个事实的度量,比如在交易中有交易额、交易数量等,这时候要明确是按照交易额汇总还是按照成交数量汇总。 -
公共汇总层设计原则
除了聚集基本的原则外,公共汇总层还必须遵循以下原则:
● 数据公用性。汇总的聚集会有第三者使用吗?基于某个维度的聚集是不是经常用于数据分析中?如果答案是肯定的,那么就有必要把明细数据经过汇总沉淀到聚集表中。
● 不跨数据域。数据域是在较高层次上对数据进行分类聚集的抽象。如以业务
● 区分统计周期。在表的命名上要能说明数据的统计周期,如 _Id
表示最近1天,_td 表示截至当天,_nd 表示最近N天。
离线数仓DWS层设计要点:
(1)DWS层的设计参考指标体系。
(2)DWS层的数据存储格式为ORC列式存储 + snappy压缩。
(3)DWS层表名的命名规范为dws_数据域_统计粒度_业务过程_统计周期(1d/nd/td)
注:1d表示最近1日,nd表示最近n日,td表示历史至今。
10.1 最近1日汇总表
10.1.1 交易域用户商品粒度订单最近1日汇总表
1)建表语句
DROP TABLE IF EXISTS dws_trade_user_sku_order_1d;
CREATE EXTERNAL TABLE dws_trade_user_sku_order_1d
(
`user_id` STRING COMMENT '用户id',
`sku_id` STRING COMMENT 'sku_id',
`sku_name` STRING COMMENT 'sku名称',
`category1_id` STRING COMMENT '一级分类id',
`category1_name` STRING COMMENT '一级分类名称',
`category2_id` STRING COMMENT '一级分类id',
`category2_name` STRING COMMENT '一级分类名称',
`category3_id` STRING COMMENT '一级分类id',
`category3_name` STRING COMMENT '一级分类名称',
`tm_id` STRING COMMENT '品牌id',
`tm_name` STRING COMMENT '品牌名称',
`order_count_1d` BIGINT COMMENT '最近1日下单次数',
`order_num_1d` BIGINT COMMENT '最近1日下单件数',
`order_original_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单原始金额',
`activity_reduce_amount_1d` DECIMAL(16, 2) COMMENT '最近1日活动优惠金额',
`coupon_reduce_amount_1d` DECIMAL(16, 2) COMMENT '最近1日优惠券优惠金额',
`order_total_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单最终金额'
) COMMENT '交易域用户商品粒度订单最近1日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_user_sku_order_1d'
TBLPROPERTIES ('orc.compress' = 'snappy');
2)数据装载
(1)首日装载
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_trade_user_sku_order_1d partition(dt)
select
user_id,
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name,
order_count_1d,
order_num_1d,
order_original_amount_1d,
activity_reduce_amount_1d,
coupon_reduce_amount_1d,
order_total_amount_1d,
dt
from
(
select
dt,
user_id,
sku_id,
count(*) order_count_1d,
sum(sku_num) order_num_1d,
sum(split_original_amount) order_original_amount_1d,
sum(nvl(split_activity_amount,0.0)) activity_reduce_amount_1d,
sum(nvl(split_coupon_amount,0.0)) coupon_reduce_amount_1d,
sum(split_total_amount) order_total_amount_1d
from dwd_trade_order_detail_inc
group by dt,user_id,sku_id
)od
left join
(
select
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name
from dim_sku_full
where dt='2020-06-14'
)sku
on od.sku_id=sku.id;
(2)每日装载
insert overwrite table dws_trade_user_sku_order_1d partition(dt='2020-06-15')
select
user_id,
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name,
order_count,
order_num,
order_original_amount,
activity_reduce_amount,
coupon_reduce_amount,
order_total_amount
from
(
select
user_id,
sku_id,
count(*) order_count,
sum(sku_num) order_num,
sum(split_original_amount) order_original_amount,
sum(nvl(split_activity_amount,0)) activity_reduce_amount,
sum(nvl(split_coupon_amount,0)) coupon_reduce_amount,
sum(split_total_amount) order_total_amount
from dwd_trade_order_detail_inc
where dt='2020-06-15'
group by user_id,sku_id
)od
left join
(
select
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name
from dim_sku_full
where dt='2020-06-15'
)sku
on od.sku_id=sku.id;
## 10.1.2 交易域用户商品粒度退单最近1日汇总表
1)建表语句
DROP TABLE IF EXISTS dws_trade_user_sku_order_refund_1d;
CREATE EXTERNAL TABLE dws_trade_user_sku_order_refund_1d
(
`user_id` STRING COMMENT '用户id',
`sku_id` STRING COMMENT 'sku_id',
`sku_name` STRING COMMENT 'sku名称',
`category1_id` STRING COMMENT '一级分类id',
`category1_name` STRING COMMENT '一级分类名称',
`category2_id` STRING COMMENT '一级分类id',
`category2_name` STRING COMMENT '一级分类名称',
`category3_id` STRING COMMENT '一级分类id',
`category3_name` STRING COMMENT '一级分类名称',
`tm_id` STRING COMMENT '品牌id',
`tm_name` STRING COMMENT '品牌名称',
`order_refund_count_1d` BIGINT COMMENT '最近1日退单次数',
`order_refund_num_1d` BIGINT COMMENT '最近1日退单件数',
`order_refund_amount_1d` DECIMAL(16, 2) COMMENT '最近1日退单金额'
) COMMENT '交易域用户商品粒度退单最近1日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_user_sku_order_refund_1d'
TBLPROPERTIES ('orc.compress' = 'snappy');
2)数据装载
(1)首日装载
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_trade_user_sku_order_refund_1d partition(dt)
select
user_id,
sku_id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name,
order_refund_count,
order_refund_num,
order_refund_amount,
dt
from
(
select
dt,
user_id,
sku_id,
count(*) order_refund_count,
sum(refund_num) order_refund_num,
sum(refund_amount) order_refund_amount
from dwd_trade_order_refund_inc
group by dt,user_id,sku_id
)od
left join
(
select
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name
from dim_sku_full
where dt='2020-06-14'
)sku
on od.sku_id=sku.id;
(2)每日装载
insert overwrite table dws_trade_user_sku_order_refund_1d partition(dt='2020-06-15')
select
user_id,
sku_id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name,
order_refund_count,
order_refund_num,
order_refund_amount
from
(
select
user_id,
sku_id,
count(*) order_refund_count,
sum(refund_num) order_refund_num,
sum(refund_amount) order_refund_amount
from dwd_trade_order_refund_inc
where dt='2020-06-15'
group by user_id,sku_id
)od
left join
(
select
id,
sku_name,
category1_id,
category1_name,
category2_id,
category2_name,
category3_id,
category3_name,
tm_id,
tm_name
from dim_sku_full
where dt='2020-06-15'
)sku
on od.sku_id=sku.id;
10.1.3 交易域用户粒度订单最近1日汇总表
1)建表语句
DROP TABLE IF EXISTS dws_trade_user_order_1d;
CREATE EXTERNAL TABLE dws_trade_user_order_1d
(
`user_id` STRING COMMENT '用户id',
`order_count_1d` BIGINT COMMENT '最近1日下单次数',
`order_num_1d` BIGINT COMMENT '最近1日下单商品件数',
`order_original_amount_1d` DECIMAL(16, 2) COMMENT '最近1日最近1日下单原始金额',
`activity_reduce_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单活动优惠金额',
`coupon_reduce_amount_1d` DECIMAL(16, 2) COMMENT '下单优惠券优惠金额',
`order_total_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单最终金额'
) COMMENT '交易域用户粒度订单最近1日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_user_order_1d'
TBLPROPERTIES ('orc.compress' = 'snappy');
2)数据装载
(1)首日装载
insert overwrite table dws_trade_user_order_1d partition(dt)
select
user_id,
count(distinct(order_id)),
sum(sku_num),
sum(split_original_amount),
sum(nvl(split_activity_amount,0)),
sum(nvl(split_coupon_amount,0)),
sum(split_total_amount),
dt
from dwd_trade_order_detail_inc
group by user_id,dt;
(2)每日装载
insert overwrite table dws_trade_user_order_1d partition(dt='2020-06-15')
select
user_id,
count(distinct(order_id)),
sum(sku_num),
sum(split_original_amount),
sum(nvl(split_activity_amount,0)),
sum(nvl(split_coupon_amount,0)),
sum(split_total_amount)
from dwd_trade_order_detail_inc
where dt='2020-06-15'
group by user_id;
10.1.4 交易域用户粒度加购最近1日汇总表
1)建表语句
DROP TABLE IF EXISTS dws_trade_user_cart_add_1d;
CREATE EXTERNAL TABLE dws_trade_user_cart_add_1d
(
`user_id` STRING COMMENT '用户id',
`cart_add_count_1d` BIGINT COMMENT '最近1日加购次数',
`cart_add_num_1d` BIGINT COMMENT '最近1日加购商品件数'
) COMMENT '交易域用户粒度加购最近1日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_user_cart_add_1d'
TBLPROPERTIES ('orc.compress' = 'snappy');
2)数据装载
(1)首日装载
insert overwrite table dws_trade_user_cart_add_1d partition(dt)
select
user_id,
count(*),
sum(sku_num),
dt
from dwd_trade_cart_add_inc
group by user_id,dt;
(2)每日装载
insert overwrite table dws_trade_user_cart_add_1d partition(dt='2020-06-15')
select
user_id,
count(*),
sum(sku_num)
from dwd_trade_cart_add_inc
where dt='2020-06-15'
group by user_id;
10.1.5 交易域用户粒度支付最近1日汇总表
1)建表语句
DROP TABLE IF EXISTS dws_trade_user_payment_1d;
CREATE EXTERNAL TABLE dws_trade_user_payment_1d
(
`user_id` STRING COMMENT '用户id',
`payment_count_1d` BIGINT COMMENT '最近1日支付次数',
`payment_num_1d` BIGINT COMMENT '最近1日支付商品件数',
`payment_amount_1d` DECIMAL(16, 2) COMMENT '最近1日支付金额'
) COMMENT '交易域用户粒度支付最近1日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_user_payment_1d'
TBLPROPERTIES ('orc.compress' = 'snappy');
2)数据装载
(1)首日装载
insert overwrite table dws_trade_user_payment_1d partition(dt)
select
user_id,
count(distinct(order_id)),
sum(sku_num),
sum(split_payment_amount),
dt
from dwd_trade_pay_detail_suc_inc
group by user_id,dt;
(2)每日装载
insert overwrite table dws_trade_user_payment_1d partition(dt='2020-06-15')
select
user_id,
count(distinct(order_id)),
sum(sku_num),
sum(split_payment_amount)
from dwd_trade_pay_detail_suc_inc
where dt='2020-06-15'
group by user_id;
10.1.6 交易域省份粒度订单最近1日汇总表
1)建表语句
DROP TABLE IF EXISTS dws_trade_province_order_1d;
CREATE EXTERNAL TABLE dws_trade_province_order_1d
(
`province_id` STRING COMMENT '用户id',
`province_name` STRING COMMENT '省份名称',
`area_code` STRING COMMENT '地区编码',
`iso_code` STRING COMMENT '旧版ISO-3166-2编码',
`iso_3166_2` STRING COMMENT '新版版ISO-3166-2编码',
`order_count_1d` BIGINT COMMENT '最近1日下单次数',
`order_original_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单原始金额',
`activity_reduce_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单活动优惠金额',
`coupon_reduce_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单优惠券优惠金额',
`order_total_amount_1d` DECIMAL(16, 2) COMMENT '最近1日下单最终金额'
) COMMENT '交易域省份粒度订单最近1日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_province_order_1d'
TBLPROPERTIES ('orc.compress' = 'snappy');
2)数据装载
(1)首日装载
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_trade_province_order_1d partition(dt)
select
province_id,
province_name,
area_code,
iso_code,
iso_3166_2,
order_count_1d,
order_original_amount_1d,
activity_reduce_amount_1d,
coupon_reduce_amount_1d,
order_total_amount_1d,
dt
from
(
select
province_id,
count(distinct(order_id)) order_count_1d,
sum(split_original_amount) order_original_amount_1d,
sum(nvl(split_activity_amount,0)) activity_reduce_amount_1d,
sum(nvl(split_coupon_amount,0)) coupon_reduce_amount_1d,
sum(split_total_amount) order_total_amount_1d,
dt
from dwd_trade_order_detail_inc
group by province_id,dt
)o
left join
(
select
id,
province_name,
area_code,
iso_code,
iso_3166_2
from dim_province_full
where dt='2020-06-14'
)p
on o.province_id=p.id;
(2)每日装载
insert overwrite table dws_trade_province_order_1d partition(dt='2020-06-15')
select
province_id,
province_name,
area_code,
iso_code,
iso_3166_2,
order_count_1d,
order_original_amount_1d,
activity_reduce_amount_1d,
coupon_reduce_amount_1d,
order_total_amount_1d
from
(
select
province_id,
count(distinct(order_id)) order_count_1d,
sum(split_original_amount) order_original_amount_1d,
sum(nvl(split_activity_amount,0)) activity_reduce_amount_1d,
sum(nvl(split_coupon_amount,0)) coupon_reduce_amount_1d,
sum(split_total_amount) order_total_amount_1d
from dwd_trade_order_detail_inc
where dt='2020-06-15'
group by province_id
)o
left join
(
select
id,
province_name,
area_code,
iso_code,
iso_3166_2
from dim_province_full
where dt='2020-06-15'
)p
on o.province_id=p.id;
10.1.7 交易域用户粒度退单最近1日汇总表
1)建表语句
DROP TABLE IF EXISTS dws_trade_user_order_refund_1d;
CREATE EXTERNAL TABLE dws_trade_user_order_refund_1d
(
`user_id` STRING COMMENT '用户id',
`order_refund_count_1d` BIGINT COMMENT '最近1日退单次数',
`order_refund_num_1d` BIGINT COMMENT '最近1日退单商品件数',
`order_refund_amount_1d` DECIMAL(16, 2) COMMENT '最近1日退单金额'
) COMMENT '交易域用户粒度退单最近1日汇总事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dws/dws_trade_user_order_refund_1d'
TBLPROPERTIES ('orc.compress' = 'snappy');
2)数据装载
(1)首日装载
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_trade_user_order_refund_1d partition(dt)
select
user_id,
count(*) order_refund_count,
sum(refund_num) order_refund_num,
sum(refund_amount) order_refund_amount,
dt
from dwd_trade_order_refund_inc
group by user_id,dt;
(2)每日装载
insert overwrite table dws_trade_user_order_refund_1d partition(dt='2020-06-15')
select
user_id,
count(*),
sum(refund_num),
sum(refund_amount)
from dwd_trade_order_refund_inc
where dt='2020-06-15'
group by user_id;
10.1.8 流量域会话粒度页面浏览最近1日汇总表
1)建表语句
DROP TABLE IF EXISTS dws_traffic_session_page_view_1d;
CREATE EXTERNAL