测试数据:http://www.nber.org/patents/apat63_99.zip
测试环境:hadoop-2.3 + hive-0.13.1 + hbase-0.98.4
测试效率:6列6亿的Hive表数据半小时
-
创建hfile.hql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 1drop table hbase_splits;
2CREATE EXTERNAL TABLE IF NOT EXISTS hbase_splits(partition STRING)
3ROW FORMAT
4 SERDE 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'
5STORED AS
6 INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
7 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
8LOCATION '/tmp/hbase_splits_out';
9
10-- create a location to store the resulting HFiles
11drop table hbase_hfiles;
12CREATE TABLE if not exists hbase_hfiles(rowkey STRING, pageviews STRING, bytes STRING)
13STORED AS
14 INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
15 OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat'
16TBLPROPERTIES('hfile.family.path' = '/tmp/hbase_hfiles/w');
17
18ADD JAR /root/hive-0.13.1/lib/hive-contrib-0.13.1.jar;
19SET mapred.reduce.tasks=1;
20
21CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
22
23-- input file contains ~4mm records. Sample it so as to produce 5 input splits.
24INSERT OVERWRITE TABLE hbase_splits
25SELECT PATENT FROM
26 (SELECT PATENT, row_seq() AS seq FROM apat tablesample(bucket 1 out of 1000 on PATENT) s
27order by PATENT
28limit 10000000) x
29WHERE (seq % 300) = 0
30ORDER BY PATENT
31LIMIT 4;
32
33-- after this is finished, combined the splits file:
34dfs -rmr /tmp/hbase_splits;
35dfs -cp /tmp/hbase_splits_out/* /tmp/hbase_splits;
36
37ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-client-0.98.4-hadoop2.jar;
38ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-common-0.98.4-hadoop2.jar;
39ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-examples-0.98.4-hadoop2.jar;
40ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-hadoop2-compat-0.98.4-hadoop2.jar;
41ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-hadoop-compat-0.98.4-hadoop2.jar;
42ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-it-0.98.4-hadoop2.jar;
43ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-prefix-tree-0.98.4-hadoop2.jar;
44ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-protocol-0.98.4-hadoop2.jar;
45ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-server-0.98.4-hadoop2.jar;;
46ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-shell-0.98.4-hadoop2.jar;
47ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-testing-util-0.98.4-hadoop2.jar;
48ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-thrift-0.98.4-hadoop2.jar;
49ADD JAR /root/hive-0.13.1/lib/hive-hbase-handler-0.13.1.jar;
50
51SET mapred.reduce.tasks=5;
52--SET total.order.partitioner.path=/tmp/hbase_splits;
53set mapreduce.totalorderpartitioner.path=/tmp/hbase_splits;
54SET hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
55
56-- generate hfiles using the splits ranges
57INSERT OVERWRITE TABLE hbase_hfiles
58SELECT PATENT,GYEAR,GDATE FROM apat
59CLUSTER BY PATENT;
60
-
创建hbase表
1
2
3 1#echo "create 'apat','w'" | hbase shell
2
3
-
导入数据
1
2 1#hive -f hfile.hql
2
1
2 1#hadoop jar /root/hbase-0.98.4-hadoop2/lib/hbase-server-0.98.4-hadoop2.jar completebulkload /tmp/hbase_hfiles apat
2
1
2
3 1# echo "scan 'apat', { LIMIT => 2 }" | hbase shell
2
3
-
可能遇到的问题:
-
Can't read partitions file,需要设置SET total.order.partitioner.path=/tmp/hbase_splits;为set mapreduce.totalorderpartitioner.path=/tmp/hbase_splits;前面那个是hadoop1用的,报错信息如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24 1Diagnostic Messages for this Task:
2Error: java.lang.IllegalArgumentException: Can't read partitions file
3 at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)
4 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
5 at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
6 at org.apache.hadoop.mapred.MapTask$OldOutputCollector.<init>(MapTask.java:569)
7 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
8 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
9 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
10 at java.security.AccessController.doPrivileged(Native Method)
11 at javax.security.auth.Subject.doAs(Subject.java:396)
12 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
13 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
14Caused by: java.io.FileNotFoundException: File file:/home/hadoop/hadoopData/tmp/nm-local-dir/usercache/root/appcache/application_1412175430384_0093/container_1412175430384_0093_01_000008/_partition.lst does not exist
15 at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
16 at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:722)
17 at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
18 at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
19 at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1749)
20 at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
21 at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:301)
22 at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:88)
23 ... 10 more
24
-
java.io.IOException: Added a key not lexically larger than previous key,可能是rowkey没有设置对或者有重复的rowkey
-
Split points are out of order,splite文件没有排序
-
reduce 阶段 No files found…,hbase_splits对应的区域内没有数据
-
Wrong number of partitions in keyset,这是因为hbase_splits中的条数和后面SET mapred.reduce.tasks=5不一致造成的,hbase_splits应该是mapred.reduce.tasks的值减1
-
入库时报Class not found:org.apache.hadoop.hbase.filter.Filter,需要把hbase需要的jar报复制到/root/hadoop-2.3.0/share/hadoop/common/lib,如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14 1cd /root/hadoop-2.3.0/share/hadoop/common/lib
2ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-client-0.98.4-hadoop2.jar hbase-client-0.98.4-hadoop2.jar
3ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-common-0.98.4-hadoop2.jar hbase-common-0.98.4-hadoop2.jar
4ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-examples-0.98.4-hadoop2.jar hbase-examples-0.98.4-hadoop2.jar
5ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-hadoop2-compat-0.98.4-hadoop2.jar hbase-hadoop2-compat-0.98.4-hadoop2.jar
6ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-hadoop-compat-0.98.4-hadoop2.jar hbase-hadoop-compat-0.98.4-hadoop2.jar
7ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-it-0.98.4-hadoop2.jar hbase-it-0.98.4-hadoop2.jar
8ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-prefix-tree-0.98.4-hadoop2.jar hbase-prefix-tree-0.98.4-hadoop2.jar
9ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-protocol-0.98.4-hadoop2.jar hbase-protocol-0.98.4-hadoop2.jar
10ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-server-0.98.4-hadoop2.jar hbase-server-0.98.4-hadoop2.jar
11ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-shell-0.98.4-hadoop2.jar hbase-shell-0.98.4-hadoop2.jar
12ln -s /root/hbase-0.98.4-hadoop2/lib/hbase-thrift-0.98.4-hadoop2.jar hbase-thrift-0.98.4-hadoop2.jar
13ln -s /root/hbase-0.98.4-hadoop2/lib/htrace-core-2.04.jar htrace-core-2.04.jar
14
参考:
https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad