三、维度子集
有些需求不需要最细节的数据。例如更想要某个月而不是某天的记录。再比如相对于全部的销售数据,可能对某些特定状态的数据更感兴趣等。这些特定维度包含在从细节维度选择的行中,所以叫维度子集。维度子集比细节维度的数据少,因此更易使用,查询也更快。
本节中将准备两个特定维度,它们均取自现有的维度:月份维度(日期维度的子集),Pennsylvania州客户维度(客户维度的子集)。
1. 建立月份维度表
执行下面的脚本建立月份维度表。注意月份维度不包含promo_ind列,该列不适用月层次上,因为一个月中可能有多个促销期,而且并不是一个月中的每一天都是促销期。促销标记适用于天这个层次。
1 2 3 4 5 6 7 8 9 10 11 12 13
| 1USE dw;
2
3CREATE TABLE month_dim (
4 month_sk INT comment 'surrogate key',
5 month tinyint comment 'month',
6 month_name varchar(9) comment 'month name',
7 quarter tinyint comment 'quarter',
8 year smallint comment 'year'
9)
10comment 'month dimension table'
11clustered by (month_sk) into 8 buckets
12stored as orc tblproperties ('transactional'='true') ;
13 |
1 2
| 1 为了从日期维度同步导入月份维度,要把月份装载嵌入到日期维度的预装载脚本中。需要修改“建立数据仓库示例模型”里生成日期维度数据的脚本。下图显示了修改后的date_dim_generate.sh文件内容。
2 |
修改后的create_table_date_dim.sql文件内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| 1-- 设置变量以支持事务
2set hive.support.concurrency=true;
3set hive.exec.dynamic.partition.mode=nonstrict;
4set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
5set hive.compactor.initiator.on=true;
6set hive.compactor.worker.threads=1;
7
8use dw;
9
10-- 首次执行会建立日期维度临时表
11create table if not exists date_dim_tmp (
12 date date comment 'date,yyyy-mm-dd',
13 month tinyint comment 'month',
14 month_name varchar(9) comment 'month name',
15 quarter tinyint comment 'quarter',
16 year smallint comment 'year',
17 promo_ind char(1) comment 'promotion index'
18)
19comment 'date dimension table'
20row format delimited fields terminated by ','
21stored as textfile;
22
23-- 首次执行会建立日期维度表
24create table if not exists date_dim (
25 date_sk int comment 'surrogate key',
26 date date comment 'date,yyyy-mm-dd',
27 month tinyint comment 'month',
28 month_name varchar(9) comment 'month name',
29 quarter tinyint comment 'quarter',
30 year smallint comment 'year',
31 promo_ind char(1) comment 'promotion index'
32)
33comment 'date dimension table'
34clustered by (date_sk) into 8 buckets
35stored as orc tblproperties ('transactional'='true');
36
37-- 首次执行会建立月份维度表
38create table if not exists month_dim (
39 month_sk INT comment 'surrogate key',
40 month tinyint comment 'month',
41 month_name varchar(9) comment 'month name',
42 quarter tinyint comment 'quarter',
43 year smallint comment 'year'
44)
45comment 'month dimension table'
46clustered by (month_sk) into 8 buckets
47stored as orc tblproperties ('transactional'='true') ;
48 |
1 2
| 1 新增的append_date.sql文件内容如下:
2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| 1-- 设置变量以支持事务
2set hive.support.concurrency=true;
3set hive.exec.dynamic.partition.mode=nonstrict;
4set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
5set hive.compactor.initiator.on=true;
6set hive.compactor.worker.threads=1;
7
8use dw;
9
10-- 向日期维度表追加数据
11insert into date_dim
12select row_number() over (order by date) + t2.sk_max,
13 t1.date,
14 t1.month,
15 t1.month_name,
16 t1.quarter,
17 t1.year,
18 t1.promo_ind
19 from (select * from date_dim_tmp) t1
20cross join (select coalesce(max(date_sk),0) sk_max from date_dim) t2;
21
22-- 向月份维度表追加数据
23insert into month_dim
24select row_number() over (order by t1.year,t1.month) + t2.sk_max,
25 t1.month,
26 t1.month_name,
27 t1.quarter,
28 t1.year
29 from (select distinct month, month_name, quarter, year FROM date_dim_tmp) t1
30cross join (select coalesce(max(month_sk),0) sk_max from month_dim) t2;
31 |
- 生成CSV文件时去掉了代理键列,增加了促销期标记列。
- 生成date_dim.csv日期数据文件前先调用create_table_date_dim.sql脚本建表,新增一个append_date.sql脚本用于追加数据。
- 先向一个临时表date_dim_tmp中加载数据,然后在append_date.sql中处理从date_dim_tmp到date_dim的装载。之所以这样做有两个原因,一是考虑到后续可能需要追加日期,而不是重新生成所有数据,二是date_dim是一个ORC格式的二进制文件,不能直接从文本文件LOAD数据,只能从一个普通文本文件格式的表插入数据。
无论何时用修改后的脚本增加日期记录时,如果这个日期所在的月份没在月份维度中,那么该月份会被装载到月份维度中。下面测试一下日期和月份维度表数据的预装载。
(1)删除date_dim_tmp、date_dim、month_dim表
1 2 3 4 5
| 1use dw;
2drop table date_dim_tmp;
3drop table date_dim;
4drop table month_dim;
5 |
1 2
| 1(2)执行预装载,生成从2000年1月1日到2010年12月31日的日期数据
2 |
1 2
| 1./date_dim_generate.sh 2000-01-01 2010-12-31
2 |
1 2
| 1 这次执行维度表都是新建的,生成的日期和月份维度数据如下图所示。
2 |
(3)再次执行预装载,生成从2011年1月1日到2020年12月31日的日期数据
1 2
| 1./date_dim_generate.sh 2011-01-01 2020-12-31
2 |
1 2
| 1 这次执行是向已有的维度表中追加日期,生成的日期和月份维度数据如下图所示。
2 |
(4)执行上一节建立的on_demand.sh脚本文件,更新促销标记列,数据恢复原状。
月份维度是一个上卷维度,它包含基本维度的上层数据。而特定维度子集是选择基本维度的一个特定子集。执行下面的脚本建立特定维度表,并导入Pennsylvania (PA)客户维度子集数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| 1USE dw;
2
3CREATE TABLE pa_customer_dim (
4 customer_sk int comment 'surrogate key',
5 customer_number int comment 'number',
6 customer_name varchar(50) comment 'name',
7 customer_street_address varchar(50) comment 'address',
8 customer_zip_code int comment 'zipcode',
9 customer_city varchar(30) comment 'city',
10 customer_state varchar(2) comment 'state',
11 shipping_address varchar(50) comment 'shipping_address',
12 shipping_zip_code int comment 'shipping_zip_code',
13 shipping_city varchar(30) comment 'shipping_city',
14 shipping_state varchar(2) comment 'shipping_state',
15 version int comment 'version',
16 effective_date date comment 'effective date',
17 expiry_date date comment 'expiry date'
18)
19CLUSTERED BY (customer_sk) INTO 8 BUCKETS
20STORED AS ORC TBLPROPERTIES ('transactional'='true');
21 |
1 2
| 1 注意,PA客户维度子集与月份维度子集有两点区别:
2 |
- pa_customer_dim表和customer_dim表有完全相同的列,而month_dim不包含date_dim表的日期列。
- pa_customer_dim表的代理键就是客户维度的代理键,而month_dim表里的月份维度代理键并不来自日期维度。
3. 修改定期装载regular_etl.sql文件
需要增加对PA客户维度的处理,这里只是在装载完customer_dim后简单重载PA客户维度数据,修改后的regular_etl.sql文件内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277
| 1-- 设置变量以支持事务
2set hive.support.concurrency=true;
3set hive.exec.dynamic.partition.mode=nonstrict;
4set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
5set hive.compactor.initiator.on=true;
6set hive.compactor.worker.threads=1;
7
8USE dw;
9
10-- 设置SCD的生效时间和过期时间
11SET hivevar:cur_date = CURRENT_DATE();
12SET hivevar:pre_date = DATE_ADD(${hivevar:cur_date},-1);
13SET hivevar:max_date = CAST('2200-01-01' AS DATE);
14
15-- 设置CDC的上限时间
16INSERT OVERWRITE TABLE rds.cdc_time SELECT last_load, ${hivevar:cur_date} FROM rds.cdc_time;
17
18-- 装载customer维度
19-- 设置已删除记录和地址相关列上SCD2的过期,用<=>运算符处理NULL值。
20UPDATE customer_dim
21 SET expiry_date = ${hivevar:pre_date}
22 WHERE customer_dim.customer_sk IN
23(SELECT a.customer_sk
24 FROM (SELECT customer_sk,
25 customer_number,
26 customer_street_address,
27 customer_zip_code,
28 customer_city,
29 customer_state,
30 shipping_address,
31 shipping_zip_code,
32 shipping_city,
33 shipping_state
34 FROM customer_dim WHERE expiry_date = ${hivevar:max_date}) a LEFT JOIN
35 rds.customer b ON a.customer_number = b.customer_number
36 WHERE b.customer_number IS NULL OR
37 ( !(a.customer_street_address <=> b.customer_street_address)
38 OR !(a.customer_zip_code <=> b.customer_zip_code)
39 OR !(a.customer_city <=> b.customer_city)
40 OR !(a.customer_state <=> b.customer_state)
41 OR !(a.shipping_address <=> b.shipping_address)
42 OR !(a.shipping_zip_code <=> b.shipping_zip_code)
43 OR !(a.shipping_city <=> b.shipping_city)
44 OR !(a.shipping_state <=> b.shipping_state)
45 ));
46
47-- 处理customer_street_addresses列上SCD2的新增行
48INSERT INTO customer_dim
49SELECT
50 ROW_NUMBER() OVER (ORDER BY t1.customer_number) + t2.sk_max,
51 t1.customer_number,
52 t1.customer_name,
53 t1.customer_street_address,
54 t1.customer_zip_code,
55 t1.customer_city,
56 t1.customer_state,
57 t1.shipping_address,
58 t1.shipping_zip_code,
59 t1.shipping_city,
60 t1.shipping_state,
61 t1.version,
62 t1.effective_date,
63 t1.expiry_date
64FROM
65(
66SELECT
67 t2.customer_number customer_number,
68 t2.customer_name customer_name,
69 t2.customer_street_address customer_street_address,
70 t2.customer_zip_code customer_zip_code,
71 t2.customer_city customer_city,
72 t2.customer_state customer_state,
73 t2.shipping_address shipping_address,
74 t2.shipping_zip_code shipping_zip_code,
75 t2.shipping_city shipping_city,
76 t2.shipping_state shipping_state,
77 t1.version + 1 version,
78 ${hivevar:pre_date} effective_date,
79 ${hivevar:max_date} expiry_date
80 FROM customer_dim t1
81INNER JOIN rds.customer t2
82 ON t1.customer_number = t2.customer_number
83 AND t1.expiry_date = ${hivevar:pre_date}
84 LEFT JOIN customer_dim t3
85 ON t1.customer_number = t3.customer_number
86 AND t3.expiry_date = ${hivevar:max_date}
87WHERE (!(t1.customer_street_address <=> t2.customer_street_address)
88 OR !(t1.customer_zip_code <=> t2.customer_zip_code)
89 OR !(t1.customer_city <=> t2.customer_city)
90 OR !(t1.customer_state <=> t2.customer_state)
91 OR !(t1.shipping_address <=> t2.shipping_address)
92 OR !(t1.shipping_zip_code <=> t2.shipping_zip_code)
93 OR !(t1.shipping_city <=> t2.shipping_city)
94 OR !(t1.shipping_state <=> t2.shipping_state)
95 )
96 AND t3.customer_sk IS NULL) t1
97CROSS JOIN
98(SELECT COALESCE(MAX(customer_sk),0) sk_max FROM customer_dim) t2;
99
100-- 处理customer_name列上的SCD1
101-- 因为hive的update的set子句还不支持子查询,所以这里使用了一个临时表存储需要更新的记录,用先delete再insert代替update
102-- 因为SCD1本身就不保存历史数据,所以这里更新维度表里的所有customer_name改变的记录,而不是仅仅更新当前版本的记录
103DROP TABLE IF EXISTS tmp;
104CREATE TABLE tmp AS
105SELECT
106 a.customer_sk,
107 a.customer_number,
108 b.customer_name,
109 a.customer_street_address,
110 a.customer_zip_code,
111 a.customer_city,
112 a.customer_state,
113 a.shipping_address,
114 a.shipping_zip_code,
115 a.shipping_city,
116 a.shipping_state,
117 a.version,
118 a.effective_date,
119 a.expiry_date
120 FROM customer_dim a, rds.customer b
121 WHERE a.customer_number = b.customer_number AND !(a.customer_name <=> b.customer_name);
122DELETE FROM customer_dim WHERE customer_dim.customer_sk IN (SELECT customer_sk FROM tmp);
123INSERT INTO customer_dim SELECT * FROM tmp;
124
125-- 处理新增的customer记录
126INSERT INTO customer_dim
127SELECT
128 ROW_NUMBER() OVER (ORDER BY t1.customer_number) + t2.sk_max,
129 t1.customer_number,
130 t1.customer_name,
131 t1.customer_street_address,
132 t1.customer_zip_code,
133 t1.customer_city,
134 t1.customer_state,
135 t1.shipping_address,
136 t1.shipping_zip_code,
137 t1.shipping_city,
138 t1.shipping_state,
139 1,
140 ${hivevar:pre_date},
141 ${hivevar:max_date}
142FROM
143(
144SELECT t1.* FROM rds.customer t1 LEFT JOIN customer_dim t2 ON t1.customer_number = t2.customer_number
145 WHERE t2.customer_sk IS NULL) t1
146CROSS JOIN
147(SELECT COALESCE(MAX(customer_sk),0) sk_max FROM customer_dim) t2;
148
149-- 重载PA客户维度
150TRUNCATE TABLE pa_customer_dim;
151INSERT INTO pa_customer_dim
152SELECT
153 customer_sk
154, customer_number
155, customer_name
156, customer_street_address
157, customer_zip_code
158, customer_city
159, customer_state
160, shipping_address
161, shipping_zip_code
162, shipping_city
163, shipping_state
164, version
165, effective_date
166, expiry_date
167FROM customer_dim
168WHERE customer_state = 'PA' ;
169
170-- 装载product维度
171-- 设置已删除记录和product_name、product_category列上SCD2的过期
172UPDATE product_dim
173 SET expiry_date = ${hivevar:pre_date}
174 WHERE product_dim.product_sk IN
175(SELECT a.product_sk
176 FROM (SELECT product_sk,product_code,product_name,product_category
177 FROM product_dim WHERE expiry_date = ${hivevar:max_date}) a LEFT JOIN
178 rds.product b ON a.product_code = b.product_code
179 WHERE b.product_code IS NULL OR (a.product_name <> b.product_name OR a.product_category <> b.product_category));
180
181-- 处理product_name、product_category列上SCD2的新增行
182INSERT INTO product_dim
183SELECT
184 ROW_NUMBER() OVER (ORDER BY t1.product_code) + t2.sk_max,
185 t1.product_code,
186 t1.product_name,
187 t1.product_category,
188 t1.version,
189 t1.effective_date,
190 t1.expiry_date
191FROM
192(
193SELECT
194 t2.product_code product_code,
195 t2.product_name product_name,
196 t2.product_category product_category,
197 t1.version + 1 version,
198 ${hivevar:pre_date} effective_date,
199 ${hivevar:max_date} expiry_date
200 FROM product_dim t1
201INNER JOIN rds.product t2
202 ON t1.product_code = t2.product_code
203 AND t1.expiry_date = ${hivevar:pre_date}
204 LEFT JOIN product_dim t3
205 ON t1.product_code = t3.product_code
206 AND t3.expiry_date = ${hivevar:max_date}
207WHERE (t1.product_name <> t2.product_name OR t1.product_category <> t2.product_category) AND t3.product_sk IS NULL) t1
208CROSS JOIN
209(SELECT COALESCE(MAX(product_sk),0) sk_max FROM product_dim) t2;
210
211-- 处理新增的product记录
212INSERT INTO product_dim
213SELECT
214 ROW_NUMBER() OVER (ORDER BY t1.product_code) + t2.sk_max,
215 t1.product_code,
216 t1.product_name,
217 t1.product_category,
218 1,
219 ${hivevar:pre_date},
220 ${hivevar:max_date}
221FROM
222(
223SELECT t1.* FROM rds.product t1 LEFT JOIN product_dim t2 ON t1.product_code = t2.product_code
224 WHERE t2.product_sk IS NULL) t1
225CROSS JOIN
226(SELECT COALESCE(MAX(product_sk),0) sk_max FROM product_dim) t2;
227
228-- 装载order维度
229INSERT INTO order_dim
230SELECT
231 ROW_NUMBER() OVER (ORDER BY t1.order_number) + t2.sk_max,
232 t1.order_number,
233 t1.version,
234 t1.effective_date,
235 t1.expiry_date
236 FROM
237(
238SELECT
239 order_number order_number,
240 1 version,
241 order_date effective_date,
242 '2200-01-01' expiry_date
243 FROM rds.sales_order, rds.cdc_time
244 WHERE entry_date >= last_load AND entry_date < current_load ) t1
245CROSS JOIN
246(SELECT COALESCE(MAX(order_sk),0) sk_max FROM order_dim) t2;
247
248-- 装载销售订单事实表
249INSERT INTO sales_order_fact
250SELECT
251 order_sk,
252 customer_sk,
253 product_sk,
254 date_sk,
255 order_amount,
256 order_quantity
257 FROM
258 rds.sales_order a,
259 order_dim b,
260 customer_dim c,
261 product_dim d,
262 date_dim e,
263 rds.cdc_time f
264 WHERE
265 a.order_number = b.order_number
266AND a.customer_number = c.customer_number
267AND a.order_date >= c.effective_date
268AND a.order_date < c.expiry_date
269AND a.product_code = d.product_code
270AND a.order_date >= d.effective_date
271AND a.order_date < d.expiry_date
272AND to_date(a.order_date) = e.date
273AND a.entry_date >= f.last_load AND a.entry_date < f.current_load ;
274
275-- 更新时间戳表的last_load字段
276INSERT OVERWRITE TABLE rds.cdc_time SELECT current_load, current_load FROM rds.cdc_time;
277 |
(1)执行下面的SQL脚本往客户源数据里添加一个PA的客户和四个OH的客户。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| 1use source;
2insert into customer
3( customer_name, customer_street_address, customer_zip_code,
4 customer_city, customer_state, shipping_address,
5 shipping_zip_code, shipping_city, shipping_state)
6values
7('PA Customer', '1111 Louise Dr.', '17050',
8 'Mechanicsburg', 'PA', '1111 Louise Dr.',
9 '17050', 'Mechanicsburg', 'PA'),
10('Bigger Customers', '7777 Ridge Rd.', '44102',
11 'Cleveland', 'OH', '7777 Ridge Rd.',
12 '44102', 'Cleveland', 'OH'),
13('Smaller Stores', '8888 Jennings Fwy.', '44102',
14 'Cleveland', 'OH', '8888 Jennings Fwy.',
15 '44102', 'Cleveland', 'OH'),
16('Small-Medium Retailers', '9999 Memphis Ave.', '44102',
17 'Cleveland', 'OH', '9999 Memphis Ave.',
18 '44102', 'Cleveland', 'OH'),
19('OH Customer', '6666 Ridge Rd.', '44102',
20 'Cleveland', 'OH', '6666 Ridge Rd.',
21 '44102','Cleveland', 'OH') ;
22
23commit;
24 |
使用下面的命令执行定期装载。
1 2 3
| 1use dw;
2select customer_name, customer_state, effective_date, expiry_date from pa_customer_dim;
3 |
1 2
| 1 结果是pa_customer_dim表增加了19条记录,如下图所示。
2 |
本节通过月份维度和PA客户维度的例子说明了两种典型维度子集的实现。