十七、分段维度
本节说明分段维度的实现技术。分段维度包含连续值的分段。例如,年度销售订单分段维度可能包含有叫做“低”、“中”、“高”的三档;各档定义分别为0.01到15000、15000.01到30000.00、30000.01到99999999.99。如果一个客户的年度销售订单金额为10000,则被归为“低”档。
分段维度可以存储多个分段集合。例如,可能有一个用于促销分析的分段集合,另一个用于市场细分,可能还有一个用于销售区域计划。分段一般由用户定义,而且很少能从交易源数据直接获得。
1. 年度销售订单星型模式
本小节说明如何实现一个年度订单分段维度。需要两个新的星型模式,如下图所示。星型模式的事实表使用(关联到)已有的customer_dim和一个新的year_dim表。年维度是日期维度的子集。annual_customer_segment_fact是唯一用到annual_order_segment_dim表的表。annual_order_segement_dim是分段维度表。
annual_order_segment_dim表存储多个分段集合。在下面的例子里将两个分段集合“project alpha”和“grid”导入annual_order_segment_dim表。这两种分段集合都是按照用户的年度销售订单金额将其分类。project alpha分六段,grid分三段。下表显示了这个分段的例子。
Segment Name |
Band Name |
Start Value |
End Value |
PROJECT ALPHA |
Bottom |
0.01 |
2500.00 |
PROJECT ALPHA |
Low |
2500.01 |
3000.00 |
PROJECT ALPHA |
Mid-low |
3000.01 |
4000.00 |
PROJECT ALPHA |
Mid |
4000.01 |
5500.00 |
PROJECT ALPHA |
Mid-high |
5500.01 |
6500.00 |
PROJECT ALPHA |
Top |
6500.01 |
99999999.99 |
Grid |
LOW |
0.01 |
3000.00 |
Grid |
MED |
3000.01 |
6000.00 |
Grid |
HIGH |
6000.01 |
99999999.99 |
每一分段有一个开始值和一个结束值。 分段的粒度就是本段和下段之间的间隙。粒度必须是度量的最小可能值,在销售订单金额的示例中是0.01。最后一个分段的结束值是销售订单金额可能的最大值。下面的脚本用于建立分段维度数据仓库模式。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
| 1use dw;
2create table annual_order_segment_dim (
3 segment_sk int,
4 segment_name varchar(30),
5 band_name varchar(50),
6 band_start_amount decimal(10,2),
7 band_end_amount decimal(10,2),
8 version int,
9 effective_date date,
10 expiry_date date
11)
12clustered by (segment_sk) into 8 buckets
13stored as orc tblproperties ('transactional'='true');
14
15insert into annual_order_segment_dim values (1, 'project alpha', 'bottom', 0.01, 2500.00, 1, '1900-01-01', '2200-01-01');
16insert into annual_order_segment_dim values (2, 'project alpha', 'low', 2500.01, 3000.00, 1, '1900-01-01', '2200-01-01');
17insert into annual_order_segment_dim values (3, 'project alpha', 'mid-low', 3000.01, 4000.00, 1, '1900-01-01', '2200-01-01');
18insert into annual_order_segment_dim values (4, 'project alpha', 'mid', 4000.01, 5500.00, 1, '1900-01-01', '2200-01-01');
19insert into annual_order_segment_dim values (5, 'project alpha', 'mid_high', 5500.01, 6500.00, 1, '1900-01-01', '2200-01-01');
20insert into annual_order_segment_dim values (6, 'project alpha', 'top', 6500.01, 99999999.99, 1, ' 1900-01-01', '2200-01-01');
21insert into annual_order_segment_dim values (7, 'grid', 'low', 0.01, 3000, 1, '1900-01-01', '2200-01-01');
22insert into annual_order_segment_dim values (8, 'grid', 'med', 3000.01, 6000.00, 1, ' 1900-01-01', '2200-01-01');
23insert into annual_order_segment_dim values (9, 'grid', 'high', 6000.01, 99999999.99, 1, '1900-01-01', '2200-01-01');
24
25create table year_dim (
26 year_sk int,
27 year int
28);
29
30create table annual_sales_order_fact (
31 customer_sk int,
32 year_sk int,
33 annual_order_amount decimal(10, 2)
34);
35
36create table annual_customer_segment_fact (
37 segment_sk int,
38 customer_sk int,
39 year_sk int
40);
41 |
本小节说明初始装载并进行测试。下面的初始装载脚本将order_date维度表(date_dim表的一个视图)里的数据导入year_dim表,将sales_order_fact表里的数据导入annual_sales_order_fact表,将annual_sales_order_fact表里的数据导入annual_customer_segment_fact表。此脚本装载所有历史数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| 1use dw;
2
3insert into year_dim
4select row_number() over (order by t1.year) + t2.sk_max, year
5 from (select distinct year year from order_date_dim) t1
6 cross join (select coalesce(max(year_sk),0) sk_max from year_dim) t2;
7
8insert into annual_sales_order_fact
9select a.customer_sk,
10 year_sk,
11 sum(order_amount)
12 from sales_order_fact a,
13 year_dim c,
14 order_date_dim d
15 where a.order_date_sk = d.order_date_sk
16 and c.year = d.year
17 and d.year < 2017
18 group by a.customer_sk, c.year_sk;
19
20insert into annual_customer_segment_fact
21select d.segment_sk,
22 a.customer_sk,
23 a.year_sk
24 from annual_sales_order_fact a,
25 annual_order_segment_dim d
26 where annual_order_amount >= band_start_amount
27 and annual_order_amount <= band_end_amount;
28 |
1 2
| 1 执行初始装载脚本,查询annual_customer_segment_fact表确认初始装载是成功的。
2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| 1select a.customer_sk csk,
2 a.year_sk ysk,
3 annual_order_amount amt,
4 segment_name sn,
5 band_name bn
6 from annual_customer_segment_fact a,
7 annual_order_segment_dim b,
8 year_dim c,
9 annual_sales_order_fact d
10 where a.segment_sk = b.segment_sk
11 and a.year_sk = c.year_sk
12 and a.customer_sk = d.customer_sk
13 and a.year_sk = d.year_sk
14cluster by csk, ysk, sn, bn;
15 |
注意,这里是按客户代理键customer_sk分组求和来判断分段,实际情况可能是以customer_number进行分组的,因为无论客户的scd属性如何变化,一般还是认为是一个客户。
3. 定期装载
本小节说明定期装载脚本和如何测试它。除了无需装载year_dim表以外,定期装载与初始装载类似。annual_sales_order_fact表里的数据被导入annual_customer_segment_fact表。每年调度执行下面的定期装载脚本,此脚本装载前一年的销售数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| 1use dw;
2
3insert into annual_sales_order_fact
4select a.customer_sk,
5 year_sk,
6 sum(order_amount)
7 from sales_order_fact a,
8 year_dim c,
9 order_date_dim d
10 where a.order_date_sk = d.order_date_sk
11 and c.year = d.year
12 and d.year = year(current_date) - 1
13 group by a.customer_sk, c.year_sk;
14
15insert into annual_customer_segment_fact
16select d.segment_sk,
17 a.customer_sk,
18 c.year_sk
19 from annual_sales_order_fact a,
20 year_dim c,
21 annual_order_segment_dim d
22 where a.year_sk = c.year_sk
23 and c.year = year(current_date) - 1
24 and annual_order_amount >= band_start_amount
25 and annual_order_amount <= band_end_amount;
26 |