【数仓宽表设计】我对宽表设计的思考

一、背景

在数仓开发过程中,当业务过程中存在相似性时,我们会采用宽表设计,这样可以提升数据整合性、提高数据查询效率,并降低维护成本。同时,宽表设计也便于跨系统和跨部门的数据整合与调用。但是宽表的设计从维度建模的思想来说适当破坏了规范,这一层要根据具体场景去平衡。

二、不同场景下的宽表思考

1.电商业务/金融业务

在电商场景,我们进行维度建模时首先要选择业务过程,比如下单业务,涉及订单事实表,维度包括时间/地区/优惠券/商品等,度量值有运费/优惠金额/原始金额/最终金额等,在这个特定的场景事实表单一,我们按照维度建模正常构建就可以,我认为没有宽表的设计必要;但是如果在金融业务,不仅有订单事实表,还会有借款审批事实表、还款计划事实表等,那我们能想到这里的业务过程是存在相关性的,在订单交易过程中,我们会将借款申请和借款审批信息纳入还款计划,同时将贷后数据如还款状态和还款时间也包括在内。因此这里就有必要整合宽表。

2.社区业务

在社区场景,我们进行维度建模时首先要选择业务过程,比如评论业务,涉及评论表事实表、帖子表事实表、回复事实表,这个过程我认为可以构建宽表,以评论表为主表,关联帖子表,然后关联评论回复表。显然,用户可以评论帖子也可以回复评论。下面是该场景的SQL示例:

select a.global_comment_id                                               as global_comment_id,
       a.business_id                                                     as business_id,
       a.ctime                                                           as ctime,
       a.mtime                                                           as mtime,
       a.is_deleted                                                      as is_deleted,
       a.relation_business_id                                            as relation_business_id,
       a.relation_business_type                                          as relation_business_type,
       a.target_user_account                                             as target_user_account,
       a.target_work_code                                                as target_work_code,
       a.source_user_account                                             as source_user_account,
       xxx as uid_leader,
       a.source_work_code                                                as source_work_code,
       a.parent_business_id                                              as parent_business_id,
       a.content                                                         as content,
       a.comment_status                                                  as comment_status,
       a.comment_time                                                    as comment_time,
       a.is_publisher                                                    as is_publisher,
       a.is_read                                                         as is_read,
       a.floor_num                                                       as floor_num,
       a.audit_status                                                    as audit_status,
       a.comment_type                                                    as comment_type,
       a.like_count                                                      as like_count,
       a.target_nickname                                                 as target_nickname,
       a.source_nickname                                                 as source_nickname,
       a.is_top                                                          as is_top,
       a.article_business_id                                             as article_business_id,
       a.top_time                                                        as top_time,
       b.global_article_id                                               as comment_article_id,
       b.business_id                                                     as comment_article_business_id,
       case when b.business_id = d.pitcure_business_id then 1 else 0 end as comment_article_is_pitcure_business_id,
       b.ctime                                                           as comment_article_ctime,
       b.mtime                                                           as comment_article_mtime,
       b.is_deleted                                                      as comment_article_is_deleted,
       b.article_status                                                  as comment_article_article_status,
       b.simple_content                                                  as comment_article_simple_content,
       b.title                                                           as comment_article_title,
       b.publish_time                                                    as comment_article_publish_time,
       b.latest_reply_time                                               as comment_article_latest_reply_time,
       b.user_account                                                    as comment_article_user_account,
       b.work_code                                                       as comment_article_work_code,
       b.lock_description                                                as comment_article_lock_description,
       b.audit_status                                                    as comment_article_audit_status,
       b.view_count                                                      as comment_article_view_count,
       b.comment_count                                                   as comment_article_comment_count,
       b.like_count                                                      as comment_article_like_count,
       b.favorite_count                                                  as comment_article_favorite_count,
       b.nickname                                                        as comment_article_nickname,
       b.is_top                                                          as comment_article_is_top,
       b.top_time                                                        as comment_article_top_time,
       b.score                                                           as comment_article_score,
       c.global_comment_id                                               as reply_id,
       c.business_id                                                     as reply_business_id,
       c.ctime                                                           as reply_ctime,
       c.mtime                                                           as reply_mtime,
       c.is_deleted                                                      as reply_is_deleted,
       c.relation_business_id                                            as reply_relation_business_id,
       c.relation_business_type                                          as reply_relation_business_type,
       c.target_user_account                                             as reply_target_user_account,
       c.target_work_code                                                as reply_target_work_code,
       c.source_user_account                                             as reply_source_user_account,
       c.source_work_code                                                as reply_source_work_code,
       c.parent_business_id                                              as reply_parent_business_id,
       c.content                                                         as reply_content,
       c.comment_status                                                  as reply_comment_status,
       c.comment_time                                                    as reply_comment_time,
       c.is_publisher                                                    as reply_is_publisher,
       c.is_read                                                         as reply_is_read,
       c.floor_num                                                       as reply_floor_num,
       c.audit_status                                                    as reply_audit_status,
       c.comment_type                                                    as reply_comment_type,
       c.like_count                                                      as reply_like_count,
       c.target_nickname                                                 as reply_target_nickname,
       c.source_nickname                                                 as reply_source_nickname,
       c.is_top                                                          as reply_is_top,
       c.article_business_id                                             as reply_article_business_id,
       c.top_time                                                        as reply_top_time,
       current_timestamp()                                               as create_time,
       date_format(date_add(current_date(), -1), "yyyyMMdd")                log_date
from (
         select global_comment_id,
                business_id,
                ctime,
                mtime,
                is_deleted,
                relation_business_id,
                relation_business_type,
                target_user_account,
                target_work_code,
                source_user_account,
                source_work_code,
                parent_business_id,
                content,
                comment_status,
                comment_time,
                is_publisher,
                is_read,
                floor_num,
                audit_status,
                comment_type,
                like_count,
                target_nickname,
                source_nickname,
                is_top,
                article_business_id,
                top_time
         from dwd.dwd_plnt_comt_info_global_comment_a_d --评论主表
         where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
     ) a
         left join
     (
         select global_article_id,
                business_id,
                ctime,
                mtime,
                is_deleted,
                article_status,
                simple_content,
                title,
                publish_time,
                latest_reply_time,
                user_account,
                work_code,
                lock_description,
                audit_status,
                view_count,
                comment_count,
                like_count,
                favorite_count,
                nickname,
                is_top,
                top_time,
                score
         from dwd.dwd_plnt_atcl_info_global_article_a_d --帖子表
         where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
     ) b on a.article_business_id = b.business_id and a.relation_business_type = 0
         left join
     (
         select global_comment_id,
                business_id,
                ctime,
                mtime,
                is_deleted,
                relation_business_id,
                relation_business_type,
                target_user_account,
                target_work_code,
                source_user_account,
                source_work_code,
                parent_business_id,
                content,
                comment_status,
                comment_time,
                is_publisher,
                is_read,
                floor_num,
                audit_status,
                comment_type,
                like_count,
                target_nickname,
                source_nickname,
                is_top,
                article_business_id,
                top_time
         from dwd.dwd_plnt_comt_info_global_comment_reply_a_d --评论回复表
         where log_date = date_format(date_add(current_date (), -1), "yyyyMMdd")
     ) c
     on a.business_id = c.relation_business_id and a.relation_business_type = 2
         left join
     (
         select pitcure_business_id
         from b_dwb.dwb_plnt_atcl_info_pic_plnt_atcl_a_d
         where log_date = date_format(date_add(current_date(), -1), "yyyyMMdd")
     ) d
     on a.relation_business_id = d.pitcure_business_id

3. 车联网业务

相比上述两种业务,车辆网的业务没有那么重的业务属性,这个场景的宽表更多的是弱业务性强工具性的宽表,比如下面的例子:

set hive.execution.engine = spark;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwb.dwb_plt_voyance_status_acc_1d_d partition (dt)
select car_time,
       carno,
       workoff,
       sum(if(diff_workoff >= 0, diff_workoff, 0)) over (partition by carno order by car_time )       acc_workoff,
       workon,
       sum(if(diff_workon >= 0, diff_workon, 0)) over (partition by carno order by car_time )         acc_workon,
       manual,
       sum(if(diff_manual >= 0, diff_manual, 0)) over (partition by carno order by car_time )         acc_manual, -- 累计里程
       autonomous,
       sum(if(diff_autonomous >= 0, diff_autonomous, 0)) over (partition by carno order by car_time ) acc_autonomous,
       remotecontrol,
       sum(if(diff_remotecontrol >= 0, diff_remotecontrol, 0))
           over (partition by carno order by car_time )                                               acc_remotecontrol,
       date_add(`current_date`(), -1)                                                                 dt
from (
         select car_time,
                carno,
                manual,
                lead_manual - manual as            diff_manual, --当前里程和上一条里程差分
                autonomous,
                lead_autonomous - autonomous       diff_autonomous,
                workon,
                lead_workon - workon               diff_workon,
                workoff,
                lead_workoff - workoff             diff_workoff,
                remotecontrol,
                lead_remotecontrol - remotecontrol diff_remotecontrol
         from (
                  select car_time,
                         carno,
                         nvl(manual, 0)                                                      manual,
                         lead(nvl(manual, 0), 1) over (partition by carno order by car_time) lead_manual,----手动里程整体上移一行
                         autonomous,
                         lead(autonomous, 1) over (partition by carno order by car_time)     lead_autonomous,
                         workon,
                         lead(workon, 1) over (partition by carno order by car_time)         lead_workon,
                         workoff,
                         lead(workoff, 1) over (partition by carno order by car_time)        lead_workoff,
                         remotecontrol,
                         lead(remotecontrol, 1) over (partition by carno order by car_time)  lead_remotecontrol
                  from (
                           select car_time,
                                  carno,
                                  cast(work_on_mile as double)  workon,
                                  cast(work_off_mile as double) workoff,
                                  cast(manual_mile as double)   manual, --手动里程
                                  cast(auto_mile as double)     autonomous,
                                  cast(remote_mile as double)   remotecontrol
                           from dwd.dwd_plt_vehicle_voyance_status_info_1d_d
                           where dt = date_add(`current_date`(), -1)
                             and substring(car_time, 1, 10) = date_add(`current_date`(), -1)
                             and carno is not null
                       ) l1
              ) l2
     ) l3;

这里的手动瞬时里程manual在车端上传时不是累计增长的,当车辆重新上电后里程会重新增长,为了便于后续DWS层聚合,这里将明细的里程制作为了累计增长的里程,这种宽表对业务同样重要。

三、总结

宽表设计是数仓开发的灵魂,具体怎么使用要根据业务制定和取舍,扬长避短发挥业务价值。

<think>嗯,用户现在问的是关于维度建模理论在搭建中的应用,特别是如何从业务过程出发,拆解成据域,再构建DWD、DWS、ADS层的据。之前用户已经了解了星型模型和雪花模型的区别,现在可能是在实际应用中遇到了如何分层的问题,或者想深入理解的分层架构。 首先,我需要确认用户是否了解的基本分层概念,比如ODS、DWD、DWS、ADS。从问题来看,用户似乎已经有一定的了解,但可能需要更详细的流程解释。用户提到的步骤是从业务过程开始,拆解成据域,每个据域中的细粒度流程提取维度和事实字段作为DWD层,然后聚合生成DWS表,最后到ADS层。这可能涉及到Kimball的维度建模方法论,用户想确认自己的理解是否正确。 接下来,我需要思考如何结构化地回答这个问题,确保符合之前用户提供的系统级指令,比如使用中文、行内公式用$...$,独立公式用$$...$$,结构清晰。同时,用户希望得到逐步的解析,帮助其理解整个流程是否正确,以及各层之间的关系。 可能需要分步骤解释每个层级的作用,比如DWD层是明细据,DWS层是轻度汇总,ADS层是应用层。同时,用户提到DWS的设计源于ADS的需求,这点需要强调,说明设计是自上而下还是自下而上,或者两者的结合。 还要注意是否要补充常见的误区,比如是否所有指标都直接来自DWD的聚合,或者是否需要考虑中间层的设计。此外,是否需要举例说明,比如电商中的下单业务过程如何拆解到各层,以帮助用户更好地理解。 另外,用户的问题中提到了据域的划分,这部分可能需要详细说明如何根据业务过程来划分据域,每个据域包含哪些业务流程,以及如何识别维度和事实字段。例如,在电商领域,据域可能分为用户、商品、交易等,每个域下的业务过程如下单、支付、退换货等。 还需要验证用户提到的流程是否正确,比如DWD层是否确实处理最细粒度的据,DWS是否聚合,ADS是否面向应用。同时,是否需要指出可能的设计考虑,比如维度的规范化程度(星型或雪花模型)在DWD层的处理,以及如何影响上层聚合。 最后,要确保回答结构清晰,按照用户提供的步骤逐一解析,并强调各层之间的关系和设计原则,确保用户能够理解整个维度建模的流程和每个步骤的目的。</think>### **维度建模理论在分层中的应用流程解析** 在数据仓库(Data Warehouse)设计中,基于维度建模理论的分层架构(如DWD、DWS、ADS)是核心方法论。以下结合您的描述,逐步拆解其设计逻辑与实现过程: --- #### **一、总体分层逻辑** ```text 业务过程 → 据域划分 → DWD(明细层) → DWS(汇总层) → ADS(应用层) ``` **核心思想**:自下而上逐层聚合,同时结合自上而下的需求驱动设计(如DWS层源于ADS需求)。 --- #### **二、分层设计步骤详解** ##### **1. 从业务过程出发,划分据域** - **业务过程定义**:明确企业核心业务活动(如电商的“下单”“支付”“退款”)。 - **据域划分**:将相关业务过程归类为逻辑据域(如“交易域”“用户域”“商品域”)。 - **示例**: - 交易域:包含下单、支付、退款等业务过程 - 用户域:包含注册、登录、信息更新等业务过程 ##### **2. 构建DWD层(明细据层)** - **目标**:存储最细粒度的原子据,保留原始业务过程细节。 - **设计步骤**: 1. **拆解业务过程**: - 每个业务过程对应一个事实表(如 `dwd_order_detail`)。 2. **识别字段类型**: - **维度字段**:描述业务上下文(如用户ID、商品ID、时间)。 - **事实字段**:可度量的值(如订单金额、量)。 3. **建模方式**: - 使用 **星型模型** 或 **雪花模型**(取决于规范化需求)。 - **示例(下单业务过程)**: ```sql -- DWD层表结构 CREATE TABLE dwd_order_detail ( order_id STRING COMMENT '订单ID', user_id STRING COMMENT '用户ID', product_id STRING COMMENT '商品ID', order_time TIMESTAMP COMMENT '下单时间', amount DECIMAL(10,2) COMMENT '订单金额', -- 事实字段 quantity INT COMMENT '购买量' -- 事实字段 ) PARTITIONED BY (dt STRING); ``` ##### **3. 设计DWS层(汇总层)** - **目标**:基于业务需求,对DWD层据进行轻度聚合,形成主题表。 - **设计原则**: - **需求驱动**:根据ADS层指标需求反向设计DWS表。 - **预聚合**:按常用维度(如时间、地域、类目)提前汇总,减少重复计算。 - **表化**:通过JOIN维度表生成大表(如用户画像+交易行为)。 - **示例(用户交易表)**: ```sql -- DWS层表:用户每日交易汇总 CREATE TABLE dws_user_trade_daily ( user_id STRING COMMENT '用户ID', dt STRING COMMENT '日期', total_amount DECIMAL(10,2) COMMENT '当日累计交易金额', order_count INT COMMENT '当日订单', avg_amount DECIMAL(10,2) COMMENT '笔均金额' ) PARTITIONED BY (dt STRING); -- 据加工逻辑(通过DWD层聚合) INSERT INTO dws_user_trade_daily SELECT user_id, dt, SUM(amount) AS total_amount, COUNT(order_id) AS order_count, SUM(amount) / COUNT(order_id) AS avg_amount FROM dwd_order_detail GROUP BY user_id, dt; ``` ##### **4. 生成ADS层(应用层)** - **目标**:直接面向业务场景,提供高度聚合的指标据。 - **设计要点**: - **指标导向**:按报表、看板需求定义指标(如GMV、DAU)。 - **灵活计算**:基于DWS层进一步聚合或跨主题关联。 - **示例(广告屏实时GMV)**: ```sql -- ADS层表:全站每日GMV CREATE TABLE ads_gmv_daily ( dt STRING COMMENT '日期', gmv DECIMAL(15,2) COMMENT '当日GMV' ); -- 据加工逻辑(从DWS层汇总) INSERT INTO ads_gmv_daily SELECT dt, SUM(total_amount) AS gmv FROM dws_user_trade_daily GROUP BY dt; ``` --- #### **三、关键设计原则** | **层级** | **核心原则** | **技术考量** | |----------|------------------------------------------|---------------------------------------| | **DWD** | 据完整性、细粒度保留 | 分区优化存储,保证高效抽取 | | **DWS** | 按主题预聚合、表化 | 物化视图、冗余维度字段加速查询 | | **ADS** | 指标可解释性、查询高性能 | 结果表缓存、预计算复杂指标 | --- #### **四、常见误区与优化建议** 1. **过度聚合**: - **问题**:在DWS层过早聚合所有维度,导致灵活性丧失。 - **解决**:仅对高频查询的维度预聚合,保留部分明细能力。 2. **DWS与ADS边界模糊**: - **问题**:将业务逻辑同时写在DWS和ADS层,导致链路冗余。 - **解决**:DWS层聚焦通用汇总(如按日统计),ADS层处理业务定制逻辑(如同比环比)。 3. **维度管理混乱**: - **问题**:未统一维度字段口径(如“用户ID”在DWD/DWS层定义不一致)。 - **解决**:建立维度总线矩阵,确保跨层一致性。 --- #### **五、总结** - **DWD层**是的基石,需严格保证据粒度与业务过程对齐。 - **DWS层**是性能与灵活性的平衡点,需结合业务需求反向设计。 - **ADS层**直接驱动决策,需高度抽象且易于理解。 - **核心价值**:通过分层设计,实现 **据复用性、计算效率、业务扩展性** 的平衡。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值