CDH Hadoop系列目录:
Hadoop实战(3)_虚拟机搭建CDH的全分布模式
Hadoop实战(4)_Hadoop的集群管理和资源分配
Hadoop实战(5)_Hadoop的运维经验
Hadoop实战(8)_CDH添加Hive服务及Hive基础
Hadoop实战(9)_Hive进阶及UDF开发
Sqoop语法说明
Sqoop官方学习文档:http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.9.0/
Sqoop import是相对于HDFS来讲,即从关系数据库import到HDFS上。
mysql的驱动包放到sqoop/lib下。
案例一:把数据导入到HDFS上
1
2
3
4
5
6
7
8
9
10 1/root/project
2mkdir sqoop_prj
3cd sqoop_prj/
4mkdir DBS
5cd DBS/
6touch DBS.opt
7
8hadoop fs -mkdir /user/hive/warehouse/DBS
9which sqoop
10
执行opt文件,不能传参,sqoop –options-file aa.opt。-m,指定map数,如果抽取的表数据量大,则调大map数。如果-m设置为5,5个线程,则在HDFS上产生5个文件。
把sqoop写到shell脚本的好处,可以传参数。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 1
2#!/bin/sh
3. /etc/profile
4
5hadoop fs -rmr /user/hive/warehouse/DBS
6
7
8
9sqoop import --connect "jdbc:mysql://cdhmaster:3306/hive" \
10--username root \
11--password 123456 \
12-m 1 \
13--table DBS \
14--columns "DB_ID,DESC,DB_LOCATION_URI,NAME,OWNER_NAME,OWNER_TYPE" \
15--target-dir "/user/hive/warehouse/DBS"
16
17#--where "length(DESC)>0" \
18#--null-string ''
19
bug,驱动问题
1
2
3 1ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3c1a42fa is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
2java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3c1a42fa is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
3
增加参数,参考
https://stackoverflow.com/questions/29162447/sqoop-import-issue-with-mysql
1
2 1--driver com.mysql.jdbc.Driver
2
增加参数后的警告,
1
2 1WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
2
bug,sql语法问题,
1
2 1Error: java.io.IOException: SQLException in nextKeyValue
2
去掉关键词列DESC,参考,
案例二:数据写Hive普通表(非分区表)
1
2
3
4
5
6
7
8
9
10
11 1# mysql
2create table test (id int, pdate date);
3insert into test(id, pdate) values (1, '2017-11-05');
4insert into test(id, pdate) values (2, '2017-11-06');
5insert into test(id, pdate) values (3, '2017-11-05');
6insert into test(id, pdate) values (4, '2017-11-06');
7
8# hive
9drop table if exists test;
10create table test(id int, pdate string);
11
–hive-import,指定要写入hive表,该参数无value。
–hive-overwrite。
–hive-table,test。
案例三:写Hive分区表,so,salesorder
注意事项:
1、用什么字段做分区?
创建时间,而不是last_modify_time。
Q: 用创建时间抽取至hive分区,订单状态变化周期是45天,订单状态变化后,hive数据如何同步?
1
2
3
4
5
6
7
8
9
10 1# cdhmaster
2cd ~
3mysql -uroot -p123456 < so.sql
4ERROR 1046 (3D000) at line 3: No database selected
5
6vi so.sql
7use test;
8
9mysql -uroot -p123456 < so.sql
10
1
2
3
4
5
6
7
8 1# hive
2CREATE TABLE so (
3 order_id bigint,
4 user_id bigint,
5 order_amt double ,
6 last_modify_time string
7) partitioned by (date string);
8
Sqoop执行后,注意:
- 会在该用户HDFS的home目录下,产生一个与源表同名的目录,如/user/root/so
如果sqoop import至hive成功,该目录会自动删掉。
- 在执行的目录下产生一个java文件,即opt转化的MR Job代码。
- sqoop import中,无论hive表是什么列分隔符,均可以自动兼容。
Sqoop抽取框架封装:
- 建一个mysql配置表,配置需要抽取的表及信息;
- Java读取mysql配置表,动态生成opt文件;
- Java中执行Process类调本地系统命令—sqoop –options-file opt文件;
Sqoop-imp -task 1 “2015-04-21”
Sqoop-imp “2015-04-21”
Sqoop export
1
2
3
4 1# mysql test
2create table so1 as
3select * from so where 1=0;
4
源头必须是HDFS/Hive,目标关系数据库。
表so1的date和last_modify_time修改为varchar。
Sqoop工具封装
Flow etl 执行所有已配置的表抽取。
Flow etl -task 1
Flow etl -task 1 2017-01-01
- 读取mysql的extract_to_hdfs和extract_db_info,根据配置信息生成.opt文件。
- 通过Java的Process类调Linux命令:sqoop –options-file opt文件。
idea打包Flow.jar,'D:/Java/idea/IdeaProjects/Hive_Prj/src/META-INF/MANIFEST.MF' already exists in VFS,删掉文件夹META-INF。
db.properties是访问mysql数据库的配置。
extract_db_info,抽取的表来自的数据库的配置。
Flow.jar上传至/root/project/lib。
/root/project/bin,创建Flow命令。
配置FLOW_HOME,
1
2
3
4
5
6 1vi /etc/profile
2
3export FLOW_HOME=/root/project
4
5source /etc/profile
6
配置db.properties,
1
2
3
4
5
6
7
8
9
10 1# FLOW_HOME
2mkdir conf
3
4vi db.properties
5
6db.driver=com.mysql.jdbc.Driver
7db.url=jdbc:mysql://cdhmaster:3306/test
8db.user=root
9db.password=123456
10
配置sqoop option目录sqoop/opts。
1
2
3 1# FLOW_HOME
2mkdir -p sqoop/opts
3
如果要在执行时产生日志,需要开发jar时配置log4j。
1
2
3 1ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@310d117d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
2java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@310d117d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
3
HDFSExtract.java,增加配置–driver com.mysql.jdbc.Driver,重新打包上传。
作业可以相应做修改,如sh ./so.sh
1
2
3
4
5 1# /root/project/sqoop_prj/DBS
2vi so.sh
3
4Flow etl -task 1 $yestoday
5