释放双眼，带上耳机，听听看~！

CDH Hadoop系列目录：

Hadoop实战（3）_虚拟机搭建CDH的全分布模式

Hadoop实战（4）_Hadoop的集群管理和资源分配

Hadoop实战（5）_Hadoop的运维经验

Hive体系结构

Hive有2个服务端守护进程：Hiveserver2：支撑JDBC访问，Thrift服务。MetaStore Server：支撑访问元数据库的服务。

Hive内核结构

Complier：编译器，编译hql语法。

Optimizer：优化hql代码，产生最优执行计划。通过explain select …查看执行计划。

Executor：执行最终转化的类(MRjob)。

Hive用户接口

用户接口主要有三个：CLI, JDBC/ODBC和WebGUI。

CLI，即hive shell命令行，Command line。

JDBC/ODBC是Hive的JAVA，与使用传统数据库JDBC的方式类似。

WebGUI是通过浏览器访问Hive，废弃功能。

添加Hive服务

添加服务-Hive，Gateway空，Hive Metastore Server选择cdhmaster，HiveServer2选择cdhslave1。使用嵌入式数据库测试连接跳过。

安装MySQL


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1yum list | grep mysql

2yum install -y mysql-server

3# 启动mysql服务

4chkconfig --list | grep mysql

5service mysqld start

6chkconfig mysqld on

7chkconfig --list | grep mysql

8# 创建root管理员

9mysqladmin -u root password 123456

10# 登录mysql

11mysql -u root -p

12# 设置字符集，否则会造成转码问题

13create database hive;

14alter database hive character set latin1;

15# 设置访问权限

16GRANT ALL PRIVILEGES ON *.* TO &#x27;root&#x27;@&#x27;%&#x27; IDENTIFIED BY &#x27;123456&#x27; WITH GRANT OPTION;

17

字符集不正确的话，可能报错。


1
2
3
1FAILED: Error in metadata: MetaException(message:Got exception: org.apache.thrift.transport.TTransportException null)

2FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

3

MySQL驱动，把mysql的驱动mysql-connector-java-5.1.18-bin.jar放在/opt/cloudera/parcels/CDH/lib/hive/lib/下。

(可选)复制mysql-connector-java-5.1.18-bin.jar到/usr/share/cmf/lib/，供cm界面用，添加hive服务跳过元数据库配置即这个驱动包可能会找不到。

Hive元数据库设置

进入cm的hive服务-配置中，

先进行资源管理，Hive Metastore Server的Java堆栈大小，200M。Hive Server2的Java堆栈大小，200M。

Hive Metastore数据库，选择MySQL。Hive Metastore数据库名称，hive。Hive Metastore数据库主机，cdhmaster。Hive Metastore数据库端口，3306。Hive Metastore数据库用户，root。Hive Metastore数据库密码，123456。自动创建和升级Hive Metastore数据库架构，打勾。严格的Hive Metastore架构验证，不打勾。

然后启动Hive服务，观察Metastore Server是否能连上mysql(实例点进去查看角色的日志)。如果连不上，就检查grant访问mysql的权限。


1
2
3
4
5
6
1[main]: Failed initialising database.

2Unable to open a test connection to the given database. JDBC url = jdbc:mysql://cdhmaster:3306/hive?useUnicode=true&amp;characterEncoding=UTF-8, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------

3java.sql.SQLException: Access denied for user &#x27;root&#x27;@&#x27;cdhmaster&#x27; (using password: YES)

4

5GRANT ALL PRIVILEGES ON *.* TO &#x27;root&#x27;@&#x27;cdhmaster&#x27; IDENTIFIED BY &#x27;123456&#x27; WITH GRANT OPTION;

6

远程元数据库

元数据库可以安装在任何节点上，客户端通过MetaStoreServer服务访问元数据库。

(Meta Store Client/Hive CLI)-MetaStore Server(thrift)-MySQL Server

hive.metastore.local
true
false
hive.metastore.uris
如thrift://192.168.1.110:9083

Hive命令


1
2
3
4
5
1show databases;

2use default;

3create table test(id int, name string);

4desc test;

5

表
内部表，又称托管表，drop后数据丢失。

外部表：create external table tableName，drop表时数据不会删除。


1
2
3
1alter table set location &#x27;&#x27;;

2alter table add partition(date=&#x27;&#x27;) location &#x27;&#x27;;

3

默认分隔符，列为\001，行为\n。


1
2
3
4
5
6
7
8
9
10
1create external table page_view_stg

2(userid bigint,

3 url string,

4 ip string comment &#x27;IP Address of the User&#x27;)

5row format delimited fields terminated by &#x27;\t&#x27;

6partitioned by (ds string, type string)

7lines terminated by &#x27;\n&#x27;

8stored as textfile

9location &#x27;/user/hive/external/city&#x27;;

10

字段类型

int
bigint，长整型
double，金额类
string，字符串，日期，非数值型的一切可以用string

Cli

hive -e “select …”

hive -f aa.sql

hive -e -i -i的作用是加载初始化命令，比如UDF


1
2
3
4
1create database dw location &#x27;/user/hive/dw&#x27;;

2

3FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=root, access=WRITE, inode=&quot;/user/hive&quot;:hive:hive:drwxrwxr-t

4

解决办法，用hdfs帐户执行


1
2
3
1su - hdfs

2hadoop fs -chmod 777 /user/hive

3


1
2
3
4
1hive

2use dw;

3create table aa(name string);

4

分区

关系DB的分区都是事先建好，一般都是通过某个字段的范围，比如date。

Hive的分区是写数据进去的时候自动建的，分区表insert时必须指定分区。

把一个文件入到Hive表有2中方式：

方式1：通过load命令

方式2：首先hadoop fs -put至HDFS，然后alter location。

Hive的insert有2种，insert overwrite(覆盖)，insert into(追加)。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
1create table track_log (

2id                         string ,

3url                        string ,

4referer                    string ,

5keyword                    string ,

6type                       string ,

7guid                       string ,

8pageId                     string ,

9moduleId                   string ,

10linkId                     string ,

11attachedInfo               string ,

12sessionId                  string ,

13trackerU                   string ,

14trackerType                string ,

15ip                         string ,

16trackerSrc                 string ,

17cookie                     string ,

18orderCode                  string ,

19trackTime                  string ,

20endUserId                  string ,

21firstLink                  string ,

22sessionViewNo              string ,

23productId                  string ,

24curMerchantId              string ,

25provinceId                 string ,

26cityId                     string )  

27PARTITIONED BY (date string,hour string)  

28ROW FORMAT DELIMITED FIELDS TERMINATED BY &#x27;\t&#x27;;

29


1
2
3
4
1hive -e &quot;LOAD DATA LOCAL INPATH &#x27;/root/data/2015082818&#x27; OVERWRITE INTO TABLE track_log PARTITION (date=&#x27;2015-08-28&#x27;,hour=&#x27;18&#x27;);&quot;

2

3hive -e &quot;LOAD DATA LOCAL INPATH &#x27;/root/data/2015082819&#x27; OVERWRITE INTO TABLE track_log PARTITION (date=&#x27;2015-08-28&#x27;,hour=&#x27;19&#x27;);&quot;

4


1
2
1select date,count(url) as pv, count(distinct guid) as uv from track_log where date=&#x27;2015-08-28&#x27; group by date;

2

分区字段名不能和普通字段重复，分区字段用起来和普通字段没区别。

动态分区，

表1是日期分区，需要把表1中数据写入表2(日期、小时分区)？


1
2
3
4
5
1insert overwrite table table2 partition(date=&#x27;&#x27;, hour=&#x27;00&#x27;) 

2select 

3from table1 

4 where hour(time)=0;

5


1
2
3
4
5
6
7
8
9
10
11
12
13
1create table rpt_visit_daily_hour 

2(

3    pv bigint,

4    uv bigint

5) partitioned by (date string, hour string);

6

7insert overwrite table rpt_visit_daily_hour partition (date=&#x27;2015-08-28&#x27;, hour) 

8select count(url) as pv, 

9count(distinct guid) as uv, 

10hour 

11from track_log 

12where date=&#x27;2015-08-28&#x27; group by date,hour;

13

Hive表数据的来源，

业务系统，sqoop用于关系db和hive/hdfs导入导出。
数据文件，hive load命令，用于加载网站用户行为数据。
其他数据表，insert … select
消息中间件，比如kafka离线消费写HDFS。

Q：drop后的外部表在什么位置？

A：外部表数据没有删除，只是删除了表的元数据信息，手工把HDFS目录映射到hive表分区：
hive -e “alter table tt add partition (date=”,hour=”) location ‘/user/hive/warehouse/track_log/date=2015-08-28/hour=18’”

Hive官方文档：

https://cwiki.apache.org/confluence/display/Hive/Tutorial

{{userData.name}}已认证

Hadoop实战（8）_CDH添加Hive服务及Hive基础

Hive体系结构

Hive内核结构

Hive用户接口

添加Hive服务

安装MySQL

Hive元数据库设置

远程元数据库

Hive命令

分区

基于spring boot和mongodb打造一套完整的权限架构（一）

Ubuntu上NFS的安装配置

{{userData.name}}已认证

Hive体系结构

Hive内核结构

Hive用户接口

添加Hive服务

安装MySQL

Hive元数据库设置

远程元数据库

Hive命令

分区

Related posts:

基于spring boot和mongodb打造一套完整的权限架构（一）

Ubuntu上NFS的安装配置

Kubernetes使用集群联邦实现多集群管理

Kubernetes 有状态集群服务部署与管理

Kubernetes入门

MySQL主从复制