释放双眼，带上耳机，听听看~！

CDH Hadoop系列目录：

Hadoop实战（3）_虚拟机搭建CDH的全分布模式

Hadoop实战（4）_Hadoop的集群管理和资源分配

Hadoop实战（5）_Hadoop的运维经验

Hadoop实战（8）_CDH添加Hive服务及Hive基础

Hadoop实战（9）_Hive进阶及UDF开发

Sqoop语法说明

Sqoop官方学习文档：http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.9.0/

Sqoop import是相对于HDFS来讲，即从关系数据库import到HDFS上。

mysql的驱动包放到sqoop/lib下。

案例一：把数据导入到HDFS上


1
2
3
4
5
6
7
8
9
10
1/root/project

2mkdir sqoop_prj

3cd sqoop_prj/

4mkdir DBS

5cd DBS/

6touch DBS.opt

7

8hadoop fs -mkdir /user/hive/warehouse/DBS

9which sqoop

10

执行opt文件，不能传参，sqoop –options-file aa.opt。-m，指定map数，如果抽取的表数据量大，则调大map数。如果-m设置为5，5个线程，则在HDFS上产生5个文件。

把sqoop写到shell脚本的好处，可以传参数。


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1

2#!/bin/sh

3. /etc/profile

4

5hadoop fs -rmr /user/hive/warehouse/DBS

6

7

8

9sqoop import  --connect &quot;jdbc:mysql://cdhmaster:3306/hive&quot;    \

10--username root                                                          \

11--password 123456                                                        \

12-m    1                                                             \

13--table  DBS                                                           \

14--columns   &quot;DB_ID,DESC,DB_LOCATION_URI,NAME,OWNER_NAME,OWNER_TYPE&quot;         \

15--target-dir  &quot;/user/hive/warehouse/DBS&quot;    

16

17#--where &quot;length(DESC)&gt;0&quot;                                               \                               

18#--null-string &#x27;&#x27;

19

bug，驱动问题


1
2
3
1ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3c1a42fa is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

2java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3c1a42fa is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

3

增加参数，参考

https://stackoverflow.com/questions/29162447/sqoop-import-issue-with-mysql

https://stackoverflow.com/questions/26375269/sqoop-error-manager-sqlmanager-error-reading-from-database-java-sql-sqlexcept


1
2
1--driver com.mysql.jdbc.Driver

2

增加参数后的警告，


1
2
1WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.

2

bug，sql语法问题，


1
2
1Error: java.io.IOException: SQLException in nextKeyValue

2

去掉关键词列DESC，参考，

https://community.cloudera.com/t5/Data-Ingestion-Integration/sqoop-throws-SQLException-in-nextKeyValue/m-p/42653

案例二：数据写Hive普通表(非分区表)


1
2
3
4
5
6
7
8
9
10
11
1# mysql

2create table test (id int, pdate date);

3insert into test(id, pdate) values (1, &#x27;2017-11-05&#x27;);

4insert into test(id, pdate) values (2, &#x27;2017-11-06&#x27;);

5insert into test(id, pdate) values (3, &#x27;2017-11-05&#x27;);

6insert into test(id, pdate) values (4, &#x27;2017-11-06&#x27;);

7

8# hive

9drop table if exists test;

10create table test(id int, pdate string);

11

–hive-import，指定要写入hive表，该参数无value。

–hive-overwrite。

–hive-table，test。

案例三：写Hive分区表，so，salesorder

注意事项：

1、用什么字段做分区？
创建时间，而不是last_modify_time。

Q: 用创建时间抽取至hive分区，订单状态变化周期是45天，订单状态变化后，hive数据如何同步？


1
2
3
4
5
6
7
8
9
10
1# cdhmaster

2cd ~

3mysql -uroot -p123456 &lt; so.sql

4ERROR 1046 (3D000) at line 3: No database selected

5

6vi so.sql

7use test;

8

9mysql -uroot -p123456 &lt; so.sql

10


1
2
3
4
5
6
7
8
1# hive

2CREATE TABLE so (

3  order_id bigint,

4  user_id bigint,

5  order_amt double ,

6  last_modify_time string

7) partitioned by (date string);

8

Sqoop执行后，注意：

会在该用户HDFS的home目录下，产生一个与源表同名的目录，如/user/root/so

如果sqoop import至hive成功，该目录会自动删掉。

在执行的目录下产生一个java文件，即opt转化的MR Job代码。
sqoop import中，无论hive表是什么列分隔符，均可以自动兼容。

Sqoop抽取框架封装：

建一个mysql配置表，配置需要抽取的表及信息；
Java读取mysql配置表，动态生成opt文件；
Java中执行Process类调本地系统命令—sqoop –options-file opt文件；

Sqoop-imp -task 1 “2015-04-21”

Sqoop-imp “2015-04-21”

Sqoop export


1
2
3
4
1# mysql test

2create table so1 as 

3select * from so where 1=0;

4

源头必须是HDFS/Hive，目标关系数据库。

表so1的date和last_modify_time修改为varchar。

Sqoop工具封装

Flow etl 执行所有已配置的表抽取。

Flow etl -task 1

Flow etl -task 1 2017-01-01

读取mysql的extract_to_hdfs和extract_db_info，根据配置信息生成.opt文件。
通过Java的Process类调Linux命令：sqoop –options-file opt文件。

idea打包Flow.jar，'D:/Java/idea/IdeaProjects/Hive_Prj/src/META-INF/MANIFEST.MF' already exists in VFS，删掉文件夹META-INF。

db.properties是访问mysql数据库的配置。

extract_db_info，抽取的表来自的数据库的配置。

Flow.jar上传至/root/project/lib。

/root/project/bin，创建Flow命令。

配置FLOW_HOME，


1
2
3
4
5
6
1vi /etc/profile

2

3export FLOW_HOME=/root/project

4

5source /etc/profile

6

配置db.properties，


1
2
3
4
5
6
7
8
9
10
1# FLOW_HOME

2mkdir conf

3

4vi db.properties

5

6db.driver=com.mysql.jdbc.Driver

7db.url=jdbc:mysql://cdhmaster:3306/test

8db.user=root

9db.password=123456

10

配置sqoop option目录sqoop/opts。


1
2
3
1# FLOW_HOME

2mkdir -p sqoop/opts

3

如果要在执行时产生日志，需要开发jar时配置log4j。


1
2
3
1ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@310d117d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

2java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@310d117d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

3

HDFSExtract.java，增加配置–driver com.mysql.jdbc.Driver，重新打包上传。

作业可以相应做修改，如sh ./so.sh


1
2
3
4
5
1# /root/project/sqoop_prj/DBS

2vi so.sh

3

4Flow etl -task 1 $yestoday

5

{{userData.name}}已认证

Hadoop实战（10）_Sqoop import与抽取框架封装

Sqoop语法说明

案例一：把数据导入到HDFS上

案例二：数据写Hive普通表(非分区表)

案例三：写Hive分区表，so，salesorder

Sqoop export

Sqoop工具封装

基于spring boot和mongodb打造一套完整的权限架构（一）

Ubuntu上NFS的安装配置

{{userData.name}}已认证

Sqoop语法说明

案例一：把数据导入到HDFS上

案例二：数据写Hive普通表(非分区表)

案例三：写Hive分区表，so，salesorder

Sqoop export

Sqoop工具封装

Related posts:

基于spring boot和mongodb打造一套完整的权限架构（一）

Ubuntu上NFS的安装配置

MySQL主从复制

Kubernetes使用集群联邦实现多集群管理

Docker与Kubernetes系列(二): Docker的基本用法

MySQL数据库优化技巧