CUBE Keyword in Apache Hive

本文深入探讨了 Apache Hive 中的 CUBE 功能,通过实验展示了如何在 TPCDS 数据集上设置 CUBE,以高效地进行多维度数据分析。文章详细介绍了数据预处理、CUBE 创建过程及使用 GROUPING_ID 进行精确查询的方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

From

CUBE Keyword in Apache Hive

Enhanced Aggregation, Cube, Grouping and Rollup

CUBE Keyword in Apache Hive

By Rajat VenkateshPublished June 19, 2015 Updated July 13th, 2018

Introduction

As part of a recent project – I had to experiment with CUBE functionality in Hive. This functionality was added somewhat recently to Hive (version 0.10) and is an advanced use case in Hive. Perhaps for these reasons – it is difficult to find examples other than the one in the Hive Wiki. In this post – I am documenting some of my experiments in setting up a CUBE on TPCDS. I hope that this is useful for other users new to Hive and/or Cubes in Hive.

Data Model

For my experiments I used 500 GB scale TPCDS data set, a 10 node Hadoop cluster and Hive 0.13.1 running on Qubole Data Service (QDS).

My goal was to calculate various measures on store sales. More specifically, I wanted to calculate:

  • Total Extended Price
  • Total Sales Price
  • Total Net Profit
  • Total Wholesale Cost
  • Total Coupon Amt
  • Total List Price

These measures then need to be broken by many dimensions. For example we can drill down in the following dimensions (Levels in parentheses):

  • Date (Year, Quarter, Month, Day)
  • Store Information (Store Id)
  • Household Demographics (Number of Dependents, Buy Potential)
  • Customer Demographics (Gender, Marital Status, Education Status)
  • Ad Channel (TV, Event, Email)
  • Time (Hour, Minute)

The ER Diagram for the relevant tables is shown below. It’s a classic star schema.

Cube-Rollup-Example-crop-redux

Introduction to Cubes

This example is a typical dimensional data model found in OLAP. The data model describes the measures and the dimensions that make the data useful. Cubes are the physical implementations of dimensional data model. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.

Build the Cube

Preprocess the data
First, I filtered the store_sales table to contain data from 2002 onwards to keep execution times reasonable for my experiments.

# 2452276 is the id in date_dim for the row of Jan 1 2002
create table store_sales_2002_plus as select * from tpcds_orc_500.store_sales where ss_sold_date_sk >= 2452276

select count(*) from store_sales_2002_plus;
278035965

Create Cube

I created a cube to store dimensions and measures I am interested in.

create table store_sales_cube as select sum(ss_ext_sales_price) as sum_extended_price, 
             sum(ss_sales_price) as sum_sales_price, sum(ss_net_profit) as sum_net_profit,
             sum(ss_wholesale_cost) as sum_wholesale_cost, sum(ss_coupon_amt) as sum_coupon_amt,
             sum(ss_list_price) as sum_list_price, 
             d_year, d_qoy, d_moy, d_date, s_store_id,
             cd_gender, cd_marital_status, cd_education_status, grouping__id
      from store_sales_2002_plus join item on ss_item_sk = i_item_sk 
             join customer on ss_customer_sk = c_customer_sk 
             join date_dim on ss_sold_date_sk = d_date_sk 
             join customer_demographics on ss_cdemo_sk = cd_demo_sk 
             join promotion on ss_promo_sk = p_promo_sk 
             join household_demographics on ss_hdemo_sk = hd_demo_sk 
             join store on ss_store_sk = s_store_sk 
             join time_dim on ss_sold_time_sk = t_time_sk
        group by d_year, d_qoy, d_moy, d_date, s_store_id,
             cd_gender, cd_marital_status, cd_education_status
        with cube;

select count(*) from store_sales_cube;
1586304

The above query generates aggregates for all possible combinations of group by columns.
Schema of store_sales_cube is:

ColumnData Type
sum_extended_pricedouble
sum_sales_pricedouble
sum_net_profitdouble
sum_wholesale_costdouble
sum_coupon_amtdouble
sum_list_pricedouble
d_yearint
d_qoyint
d_moyint
d_datetimestamp
s_store_idstring
cd_genderstring
cd_marital_statusstring
cd_education_statusstring
grouping__idstring

A few example rows are shown below.

d_yeard_qoyd_moyd_dates_store_ids_store_namecd_gendercd_marital_statuscd_educational_statusgrouping__idTotal Sales Price
20023NULLNULLNULLNULLNULLNULLNULL32.712...E9
NULLNULLNULLNULLNULLNULLNULLNULLNULL01.00...E10
NULLNULLNULLNULLNULLNULLMNULLNULL645.01..E9

The first row stores measures for d_year=2002, d_qoy=3 only. It is one of the rows in the result of

select d_year, d_qoy, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

The second row stores the measures for the complete data set.

The third row stores the measures for cd_gender=M only. It is one of the rows in the result of

select cd_gender, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join customer_demographics on ss_cdemo_sk = cd_demo_sk 
group by cd_gender;

Grouping ID

Let’s say an analyst is interested in finding sum_sales_price by gender (cd_gender). How does the analyst find the rows that store the measures for cd_gender ?
grouping_id is useful to select rows based on the dimensions of interest. grouping_id is a column generated by Hive when CUBE keyword is used. I specified it in the project list to use it in subsequent queries. Grouping ID is a bit vector of the dimensions in a cube and is stored as a base10 integer. It is generated by listing the dimensions from right to left in the same order as the group by column in the cube create SQL. Bit 1 is assigned to the dimension that occurs in a row.
For rows that have measures for cd_gender, the bit vector is 001000000. The table below has a couple of more examples.

Group By Columnsgrouping_idcd_educational_statuscd_marital_statuscd_genders_store_names_store_idd_dated_moyd_qoyd_year
cd_gender64001000000
d_year - d_qoy3000000011
d_year - d_moy5000000101
d_year - d_qoy - cd_marital_status131010000011

Lets look at an example of rows for cd_gender dimension.

select cd_gender, sum_sales_price
   from store_sales_cube where 
   `grouping__id` = conv("001000000", 2, 10);
cd_gendertotal_sales_price
M5.017321159230397E9
F5.01904028465792E9

conv is a Hive function to convert a number (specified in a string) in a specified base (in this case 2) to an integer in another base (in this case 10). It takes the string, the base of the number in the string and the base of the result as arguments.

Total Sales Price for each quarter

Let us look at queries on the raw data and cube to calculate the measures. I will use the number of rows read as a measure of speed.

Query on raw data:
select d_year, d_qoy, sum(ss_sales_price)
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

Bytes Read: 776,170,833

Query on cube:
select d_year, d_qoy, sum(sum_sales_price)
   from store_sales_cube where 
   `grouping__id` = conv("000000011", 2, 10);
Bytes Read: 11,783,322

Query on the cube scanned 1.5% of the data compared to the query on raw data tables.

Total Sales for each quarter to married customers

Lets look at another example. The following query filters the results by another dimension – cd_marital_status

select d_year, d_qoy, cd_marital_status, sum_sales_price
   from store_sales_cube where 
   `grouping__id` = conv("010000011", 2, 10) and cd_marital_status = "M";

GROUPING Functions in other databases

Grouping functions is important to choose the right cells in a cube. Other databases have similar functions. For e.g. refer to GROUPING functions in Oracle or GROUPING_ID in SQL Server.

Summary

In summary, we looked at an example of multidimensional data generated by the CUBE keyword in Apache Hive. We also understood how to use GROUPING_ID to select the right cells in a cube.

 

资源下载链接为: https://pan.quark.cn/s/f989b9092fc5 HttpServletRequestWrapper 是 Java Servlet API 中的一个工具类,位于 javax.servlet.http 包中,用于对 HttpServletRequest 对象进行封装,从而在 Web 应用中实现对 HTTP 请求的拦截、修改或增强等功能。通过继承该类并覆盖相关方法,开发者可以轻松地自定义请求处理逻辑,例如修改请求参数、添加请求头、记录日志等。 参数过滤:在请求到达处理器之前,可以对请求参数进行检查或修改,例如去除 URL 编码、过滤敏感信息或进行安全检查。 请求头操作:可以修改或添加请求头,比如设置自定义的 Content-Type 或添加认证信息。 请求属性扩展:在原始请求的基础上添加自定义属性,供后续处理使用。 日志记录:在处理请求前记录请求信息,如 URL、参数、请求头等,便于调试和监控。 跨域支持:通过添加 CORS 相关的响应头,允许来自不同源的请求。 HttpServletRequestWrapper 通过继承 HttpServletRequest 接口并重写其方法来实现功能。开发者可以在重写的方法中添加自定义逻辑,例如在获取参数时进行过滤,或在读取请求体时进行解密。当调用这些方法时,实际上是调用了包装器中的方法,从而实现了对原始请求的修改或增强。 以下是一个简单的示例,展示如何创建一个用于过滤请求参数的包装器: 在 doFilter 方法中,可以使用 CustomRequestWrapper 包装原始请求: 这样,每当调用 getParameterValues 方法时,都会先经过自定义的过滤逻辑。 HttpServletRequestWrapper 是 Java Web 开发中一个强大的工具,它提供了灵活的扩展性,允许开发者
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值