基于hadoop生态圈的数据仓库实践 —— 进阶技术(十七)

释放双眼,带上耳机,听听看~!

十七、分段维度
本节说明分段维度的实现技术。分段维度包含连续值的分段。例如,年度销售订单分段维度可能包含有叫做“低”、“中”、“高”的三档;各档定义分别为0.01到15000、15000.01到30000.00、30000.01到99999999.99。如果一个客户的年度销售订单金额为10000,则被归为“低”档。
分段维度可以存储多个分段集合。例如,可能有一个用于促销分析的分段集合,另一个用于市场细分,可能还有一个用于销售区域计划。分段一般由用户定义,而且很少能从交易源数据直接获得。
1. 年度销售订单星型模式
本小节说明如何实现一个年度订单分段维度。需要两个新的星型模式,如下图所示。星型模式的事实表使用(关联到)已有的customer_dim和一个新的year_dim表。年维度是日期维度的子集。annual_customer_segment_fact是唯一用到annual_order_segment_dim表的表。annual_order_segement_dim是分段维度表。
annual_order_segment_dim表存储多个分段集合。在下面的例子里将两个分段集合“project alpha”和“grid”导入annual_order_segment_dim表。这两种分段集合都是按照用户的年度销售订单金额将其分类。project alpha分六段,grid分三段。下表显示了这个分段的例子。

Segment Name Band Name Start Value End Value
 PROJECT ALPHA  Bottom  0.01  2500.00
 PROJECT ALPHA  Low  2500.01  3000.00
 PROJECT ALPHA  Mid-low  3000.01  4000.00
 PROJECT ALPHA  Mid  4000.01  5500.00
 PROJECT ALPHA  Mid-high  5500.01  6500.00
 PROJECT ALPHA  Top  6500.01  99999999.99
 Grid  LOW  0.01  3000.00
 Grid  MED  3000.01  6000.00
 Grid  HIGH  6000.01  99999999.99

1
1

        每一分段有一个开始值和一个结束值。 分段的粒度就是本段和下段之间的间隙。粒度必须是度量的最小可能值,在销售订单金额的示例中是0.01。最后一个分段的结束值是销售订单金额可能的最大值。下面的脚本用于建立分段维度数据仓库模式。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
1use dw;  
2create table annual_order_segment_dim (  
3    segment_sk int,  
4    segment_name varchar(30),  
5    band_name varchar(50),  
6    band_start_amount decimal(10,2),
7    band_end_amount decimal(10,2),
8    version int,
9    effective_date date,  
10    expiry_date date  
11)
12clustered by (segment_sk) into 8 buckets          
13stored as orc tblproperties ('transactional'='true');
14
15insert into annual_order_segment_dim values  (1, 'project alpha', 'bottom', 0.01, 2500.00, 1, '1900-01-01', '2200-01-01');
16insert into annual_order_segment_dim values  (2, 'project alpha', 'low', 2500.01, 3000.00, 1, '1900-01-01', '2200-01-01');
17insert into annual_order_segment_dim values  (3, 'project alpha', 'mid-low', 3000.01, 4000.00, 1, '1900-01-01', '2200-01-01');
18insert into annual_order_segment_dim values  (4, 'project alpha', 'mid', 4000.01, 5500.00, 1, '1900-01-01', '2200-01-01');
19insert into annual_order_segment_dim values  (5, 'project alpha', 'mid_high', 5500.01, 6500.00, 1, '1900-01-01', '2200-01-01');
20insert into annual_order_segment_dim values  (6, 'project alpha', 'top', 6500.01, 99999999.99, 1, ' 1900-01-01', '2200-01-01');  
21insert into annual_order_segment_dim values  (7, 'grid', 'low', 0.01, 3000, 1, '1900-01-01', '2200-01-01');  
22insert into annual_order_segment_dim values  (8, 'grid', 'med', 3000.01, 6000.00, 1, ' 1900-01-01', '2200-01-01');
23insert into annual_order_segment_dim values  (9, 'grid', 'high', 6000.01, 99999999.99, 1, '1900-01-01', '2200-01-01');  
24
25create table year_dim (  
26    year_sk int,  
27    year int
28);  
29  
30create table annual_sales_order_fact (  
31    customer_sk int,  
32    year_sk int,  
33    annual_order_amount decimal(10, 2)  
34);
35  
36create table annual_customer_segment_fact (  
37    segment_sk int,  
38    customer_sk int,  
39    year_sk int  
40);
41

1
2
1**2. 初始装载**  
2

        本小节说明初始装载并进行测试。下面的初始装载脚本将order_date维度表(date_dim表的一个视图)里的数据导入year_dim表,将sales_order_fact表里的数据导入annual_sales_order_fact表,将annual_sales_order_fact表里的数据导入annual_customer_segment_fact表。此脚本装载所有历史数据。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1use dw;  
2  
3insert into year_dim  
4select row_number() over (order by t1.year) + t2.sk_max, year
5  from (select distinct year year from order_date_dim) t1
6 cross join (select coalesce(max(year_sk),0) sk_max from year_dim) t2;
7
8insert into annual_sales_order_fact  
9select a.customer_sk,
10       year_sk,
11       sum(order_amount)  
12  from sales_order_fact a,
13       year_dim c,
14       order_date_dim d  
15 where a.order_date_sk = d.order_date_sk  
16   and c.year = d.year  
17   and d.year < 2017  
18 group by a.customer_sk, c.year_sk;  
19  
20insert into annual_customer_segment_fact  
21select d.segment_sk,
22       a.customer_sk,
23       a.year_sk  
24  from annual_sales_order_fact a,
25       annual_order_segment_dim d  
26 where annual_order_amount >= band_start_amount  
27   and annual_order_amount <= band_end_amount;  
28

1
2
1        执行初始装载脚本,查询annual_customer_segment_fact表确认初始装载是成功的。  
2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1select a.customer_sk csk,
2       a.year_sk ysk,
3       annual_order_amount amt,
4       segment_name sn,
5       band_name bn
6  from annual_customer_segment_fact a,
7       annual_order_segment_dim b,
8       year_dim c,
9       annual_sales_order_fact d
10 where a.segment_sk = b.segment_sk
11   and a.year_sk = c.year_sk
12   and a.customer_sk = d.customer_sk
13   and a.year_sk = d.year_sk
14cluster by csk, ysk, sn, bn;
15

1
2
1        查询结果如下图所示。  
2

        注意,这里是按客户代理键customer_sk分组求和来判断分段,实际情况可能是以customer_number进行分组的,因为无论客户的scd属性如何变化,一般还是认为是一个客户。
3. 定期装载
本小节说明定期装载脚本和如何测试它。除了无需装载year_dim表以外,定期装载与初始装载类似。annual_sales_order_fact表里的数据被导入annual_customer_segment_fact表。每年调度执行下面的定期装载脚本,此脚本装载前一年的销售数据。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1use dw;  
2
3insert into annual_sales_order_fact  
4select a.customer_sk,
5       year_sk,
6       sum(order_amount)  
7  from sales_order_fact a,
8       year_dim c,
9       order_date_dim d  
10 where a.order_date_sk = d.order_date_sk  
11   and c.year = d.year  
12   and d.year = year(current_date) - 1  
13 group by a.customer_sk, c.year_sk;  
14  
15insert into annual_customer_segment_fact  
16select d.segment_sk,
17       a.customer_sk,
18       c.year_sk  
19  from annual_sales_order_fact a,
20       year_dim c,
21       annual_order_segment_dim d  
22 where a.year_sk = c.year_sk
23   and c.year = year(current_date) - 1
24   and annual_order_amount >= band_start_amount  
25   and annual_order_amount <= band_end_amount;
26

给TA打赏
共{{data.count}}人
人已打赏
安全运维

MySQL和MongoDB数据相互迁移

2021-12-11 11:36:11

安全运维

Ubuntu上NFS的安装配置

2021-12-19 17:36:11

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索