HIVE SQL：工作中常用HSQL语句大全二

释放双眼，带上耳机，听听看~！

1、 hive书写顺序 — 1）hive书写顺序


1
select ……from ……where ……group by……having……order by……limit

— 2）sql执行顺序（having –过滤分组）


1
from ……where ……group by ……having ……select ……order by……limit

2、连表，条件放在on后边还是放在where后边的？ 对于left join，不管on后面跟什么条件，左表的数据全部查出来，因此要想过滤需把条件放到where后面。对于inner join，满足on后面的条件表的数据才能查出，可以起到过滤作用，也可以把条件放到where后面。 –连表查询，把条件放在where后面最稳妥3、处理日期数据和时间戳 秒级别 10位正整数；毫秒级别 13位正整数（1）10位正整数秒级别 -转日期函数：from_unixtime(bigint unixtime[, string format]) — 返回值: string


1
from_unixtime(create_time,'yyyy/MM/dd'), -- create_time 是10位毫秒级别，获取年月日



1
from_unixtime(create_time,'yyyy/MM/dd HH:mm:ss'), -- 获取年月日时分秒from_unixtime(create_time,'yyyy') -- 获取年



1
from_unixtime(1614157840,'MM') -- 获取月from_unixtime(create_time,'HH'), -- 获取小时



1
from_unixtime(create_time,'yyyy-MM-dd HH:mm:ss'), --获取年月日时分秒，只是展示格式有变化hour(from_unixtime(create_time,'yyyy-MM-dd HH:mm:ss')) -- 日期函数hour等要求日期格式是‘-’分割

（2）13位正整数毫秒级别时间戳 — time是13位毫秒级别时间戳格式，先截取到10位秒级别然后再转化成日期时间格式


1
from_unixtime(cast(substring(time,1,10) as bigint),'yyyy-MM-dd HH:mm:ss'),



1
from_unixtime(cast(substring(time,1,10) as bigint),'HH'),



1
from_unixtime(cast(substring(time,1,10) as bigint),'yyyyMMdd') ,

— 判断工作日和周末“ –0周日、1周一、2周二、3周三、4周四、5周五、6周六；pmod取余


1
case



1
when pmod(datediff(from_unixtime(cast(substring(time,1,10) as bigint),'yyyy-MM-dd'), '2018-01-07'), 7) in(6,0) then '周末'



1
else '工作日'



1
end as weekday_num

（3）求日期差-返回天差异：datediff(结束日期，开始日期) –date_p分区字段，格式bigint，比如20230615


1
datediff(from_unixtime(unix_timestamp(CAST(b.date_p AS string), 'yyyyMMdd')), from_unixtime(unix_timestamp(CAST(a.date_p AS string), 'yyyyMMdd'))) AS diff_date

— UNIX_TIMESTAMP takes only string/date/timestamp types — 返回值: bigint


1
 select datediff(from_unixtime(unix_timestamp('20210919', 'yyyyMMdd')), from_unixtime(unix_timestamp('20210909', 'yyyyMMdd')))

— 计算结果为10，不包含当天 — 20210909 20210909 计算结果为0（4）日期时间转日期函数: to_date(string timestamp) — 返回值: string


1
  to_date('2016-02-16 14:02:03') -- 返回值2016-02-16

（5）日期时间和date_p对比日期时间转时间戳unix_timestamp(‘2020-03-14 14:29:24’)—1584167364 —返回值bigint


1
 cast(from_unixtime(unix_timestamp('2020-03-14 14:29:24'),'yyyyMMdd') as bigint)&gt;=${date_s}

（6）bigint格式日期算周：


1
weekofyear(to_date(from_unixtime(unix_timestamp(CAST(feed_create_time AS string),'yyyymmdd'),'yyyy-mm-dd'))) as week

（7）date_p获取月份


1
substr(date_p,5,2) as month1

（8）date_p获取周，从周一开始算


1
weekofyear(from_unixtime(unix_timestamp(cast(date_p as string), 'yyyyMMdd')))

（9）获取当前时间：


1
current_date()  -- 2023-06-10

4、字符串处理函数 截取：substring(字符串m,第几位开始截取，截取几个字符) 去掉首位空格：trim()

5、转义字符\：— trim(regexp_replace(feed_id,’\”‘,”)) — 替换”为空– +、*、|、\等符号在正则表达式中有相应的不同意义，所以在使用时要进行转义处理，用\\+ 或者[+]都可进行转义。

6、null计算规则之前详细介绍过null和空字符串的区别，传送门：HSQL：Hive中空NULL和空字符串的区别？

hive ‘不等于’不管是用！或者<>符号实现，都会将空值即null过滤掉，此时要用 where （white_level<>’3′ or white_level is null) 或者 where (white_level!=’3′ or white_level is null ) 来保留null 的情况。

7、count(1) 和count(字段)区别： count(1)与count(*)得到的结果一致，包含null值 count(字段)不计算null值 count(null)结果恒为0

8、排名函数 （1）ROW_NUMBER：每一行记录生成一个序号，依次排序且不会重复,1,2,3(和2一样),4,5


1
row_number() over(PARTITION BY orderid ORDER BY datachange_lasttime desc) as rank

（2）rank() over()：over子句中排序字段值相同的序号是一样的，后面字段值不相同的序号将跳过相同的排名号排下一个1,2,2,4 （3）dense_rank() over()：dense_rank函数出现相同排名时，将不跳过相同排名号，rank值紧接上一次的rank值,1,2,2,3

9、uinoin和union all union会自动压缩多个结果集合中的重复结果，而union all则将所有的结果全部显示出来，不管是不是重复

10、分组聚合 （1）with cube的分组组合最全，是各个维度值的笛卡尔（包含null)组合，（2） grouping sets则为自定义维度，根据需要分组即可。 ps：通过grouping sets的使用可以简化SQL，比group by单维度进行union性能更好。（3）hive grouping sets多维度报错如果你的 grouping sets大于等于5个维度，将会报如上的错误；解决办法：在你的hql语句前面加上 set hive.new.job.grouping.set.cardinality=xx;（例如我这里是5个维度，一共32个grouping sets，xx可写为64 )注意：group by中字段与Grouping sets中字段，两者必须保持完全一致。

11、行列转换（1）行转列数据表的记录如下：

对promotion_id进行合并：


1
select product_id, concat_ws('_',collect_set(promotion_id)) as promotion_ids



1
from product_promotion



1
group by product_id

执行结果： 5112 960024_960025_960026_960027（2）列转行


1
select product_id,promotion_id



1
from&nbsp;product_promotion&nbsp;LATERAL&nbsp;VIEW&nbsp;EXPLODE(split(promotion_ids,&nbsp;','))&nbsp;f&nbsp;as&nbsp;promotion_id

12、不常用的常用函数 (1） /: 为实数除，其运算结果为实型。如: 5 / 2 = 2.5，5.0 / 2.0 = 2.5，注意，4 / 2 = 2.0，而不等于2。 (2）Div :为整除，div 运算只取商的整数部分，它要求除数和被除数均为整型，其运算结果也为整型。如：10 div 2 = 5，10 div 3 = 3，-15 div 4 = -3。 (3） mod: 为求余，只能用于整数运算，结果也为整数。如：10 mod 4 = 2，-17 mod 4 = -1，4 mod (-3) = 1，即 a mod b = a – (a div b) * b。 (4）foor() 小数取整(8.6—->8;8.2—->8) (5）stddev(c.pv) 计算标准差 –比如：计算两组实验用户的人均发文量是否有差异 –需要计算两组实验用户的发文量标准差。注意：未发布用户的发布次数置为0，再计算标准差。

13、聚合函数联合条件使用：


1
nvl(sum(if(event_id='open_click',1,0)),0) as openclick_num



1
nvl(sum(case when event_id='open_click'then 1 else 0 end),0) as openclick_num

14、字典类型提取信息：get_json_object(字段名,’$.字段中的键’)— level是字典dict类型{“value”:5, “reason”: 8, “level_reason”:”图片广告-OCR文本#文本广告#敏感信息#重度色情#信息过少”}– 想获取评级原因：


1
get_json_object(level,'$.level_reason') as level_reason  --评级原因

15、map类型 字段名：params 类型：map<string,string>{“last_upload_time”:”1584694386724″,”first_start”:”1″,”first_launch”:”0″,”launch_type”:”0″,”$launch_source”:”-1\u0007normal\u00070\u00070″} 提取规则：split(字段名,’分隔符’）按照分割符打散成数组：


1
 split(params&#091;'$launch_source'],'\u0007')&#091;0] --提取结果-1

16、array类型 字段名：resident_location_id 类型：array<struct<id:string,value:int>>[{“id”:”1007″,”value”:0},{“id”:”1006″,”value”:0},{“id”:”1005″,”value”:0},{“id”:”1015″,”value”:0}]

提取规则：resident_location_id[3].value

17、lead()over()和lag()over窗口函数 — lag() over()函数：分组排序后，组内后面一条记录减前面一条记录的差，第一条可返回 NULL — lead(date_p,6,0)over() ：返回当前date_p日期后面第6行的date_p对应的值18、版本限制 app_version<‘8.3.0.0′ cast(replace(app_version,’.’,”) as bigint)<8300

19、如果V1为空就用v2补充语句：


1
COALESCE(T v1, T v2, ...)

20、建表：设置动态分区 — 当对hive分区未做设置时，可能会报错：超过了最大的分区数设置 — 动态分区在insert数据的时候,分区字段要放在插入字段的最后面,同时要和指定的分区字段的顺序一致. — 说明：分区内容不要用中文


1
set hive.exec.dynamic.partition=true; 



1
set hive.exec.dynamic.partition.mode=nonstrict; <em>-- 设置动态分区的模式：nonstrict模式表示允许所有的分区字段都可以使用动态分区。</em> 



1
set hive.stats.reliable=false; <em>-- 如果不能够可靠的收集统计信息，则查询失败，缺省是false</em> 



1
set hive.exec.max.dynamic.partitions.pernode=1000; <em>-- 在每个执行MR的节点上，最大可以创建多少个动态分区</em> 



1
insert overwrite table A表 partition(date_p) <em>-- 按照</em>

分区覆盖写入

参考资料：
https://blog.51cto.com/u_15105906/5621807
https://blog.csdn.net/weixin_49582454/article/details/110533283

{{userData.name}}已认证

HIVE SQL：工作中常用HSQL语句大全二

安全运维之道：发现、解决问题的有效闭环

稳定性建设 – 架构优化的关键策略

{{userData.name}}已认证

Related posts:

安全运维之道：发现、解决问题的有效闭环

稳定性建设 – 架构优化的关键策略

Kubernetes之（十三）ConfigMap和Secret

Kubernetes之（二）核⼼技术概念和API对象

故障治理 – 线上故障分析与故障治理指引

2024年十大网络安全事件盘点