Spark与HBase的整合

释放双眼，带上耳机，听听看~！

前言

之前因为仅仅是把HBase当成一个可横向扩展并且具有持久化能力的KV数据库，所以只用在了指标存储上，参看很早之前的一篇文章基于HBase做Storm 实时计算指标存储。这次将HBase用在了用户行为存储上，因为Rowkey的过滤功能也很不错，可以很方便的把按人或者内容的维度过滤出所有的行为。从某种意义上，HBase的是一个有且仅有一个多字段复合索引的存储引擎。

虽然我比较推崇实时计算，不过补数据或者计算历史数据啥的，批处理还是少不了的。对于历史数据的计算，其实我是有两个选择的，一个是基于HBase的已经存储好的行为数据进行计算，或者基于Hive的原始数据进行计算，最终选择了前者，这就涉及到Spark(StreamingPro) 对HBase的批处理操作了。

整合过程

和Spark 整合，意味着最好能有Schema(Mapping),因为Dataframe 以及SQL API 都要求你有Schema。遗憾的是HBase 有没有Schema取决于使用者和场景。通常SparkOnHBase的库都要求你定义一个Mapping(Schema),比如hortonworks的 SHC(https://github.com/hortonworks-spark/shc) 就要求你定义一个如下的配置：


1
2
3
4
5
6
7
8
9
1{

2&quot;rowkey&quot;:&quot;key&quot;,

3&quot;table&quot;:{&quot;namespace&quot;:&quot;default&quot;, &quot;name&quot;:&quot;pi_user_log&quot;, &quot;tableCoder&quot;:&quot;PrimitiveType&quot;},

4&quot;columns&quot;:{&quot;col0&quot;:{&quot;cf&quot;:&quot;rowkey&quot;, &quot;col&quot;:&quot;key&quot;, &quot;type&quot;:&quot;string&quot;},

5&quot;col1&quot;:{&quot;cf&quot;:&quot;f&quot;,&quot;col&quot;:&quot;col1&quot;, &quot;type&quot;:&quot;string&quot;}

6}

7}

8

9

看上面的定义已经还是很容易看出来的。对HBase的一个列族和列取一个名字，这样就可以在Spark的DataSource API使用了，关于如何开发Spark DataSource API可以参考我的这篇文章利用 Spark DataSource API 实现Rest数据源中使用，SHC大体实现的就是这个API。现在你可以这么用了：


1
2
3
4
5
6
7
8
1 val cat = &quot;{\n\&quot;rowkey\&quot;:\&quot;key\&quot;,\&quot;table\&quot;:{\&quot;namespace\&quot;:\&quot;default\&quot;, \&quot;name\&quot;:\&quot;pi_user_log\&quot;, \&quot;tableCoder\&quot;:\&quot;PrimitiveType\&quot;},\n\&quot;columns\&quot;:{\&quot;col0\&quot;:{\&quot;cf\&quot;:\&quot;rowkey\&quot;, \&quot;col\&quot;:\&quot;key\&quot;, \&quot;type\&quot;:\&quot;string\&quot;},\n\&quot;28360592\&quot;:{\&quot;cf\&quot;:\&quot;f\&quot;,\&quot;col\&quot;:\&quot;28360592\&quot;, \&quot;type\&quot;:\&quot;string\&quot;}\n}\n}&quot;

2    val cc = sqlContext

3      .read

4      .options(Map(HBaseTableCatalog.tableCatalog -&gt; cat))

5      .format(&quot;org.apache.spark.sql.execution.datasources.hbase&quot;)

6      .load()

7

8

不过当你有成千上万个列，那么这个就无解了，你不大可能一一定义，而且很多时候使用者也不知道会有哪些列，列名甚至可能是一个时间戳。我们现在好几种情况都遇到了，所以都需要解决：

自动获取HBase里所有的列形成Schema,这样就不需要用户配置了。
规定HBase只有两个列，一个rowkey,一个 content,content 是一个map,包含所有以列族+列名为key，对应内容为value。

先说说第二种方案(因为其实第一种方案也要依赖于第二种方案)：


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1{

2        &quot;name&quot;: &quot;batch.sources&quot;,

3        &quot;params&quot;: [

4          {

5            &quot;inputTableName&quot;: &quot;log1&quot;,

6            &quot;format&quot;: &quot;org.apache.spark.sql.execution.datasources.hbase.raw&quot;,

7            &quot;path&quot;: &quot;-&quot;,

8            &quot;outputTable&quot;: &quot;log1&quot;

9          }

10        ]

11      },

12      {

13        &quot;name&quot;: &quot;batch.sql&quot;,

14        &quot;params&quot;: [

15          {

16            &quot;sql&quot;: &quot;select rowkey,json_value_collect(content) as actionList from log1&quot;,

17            &quot;outputTableName&quot;:&quot;finalTable&quot;

18          }

19        ]

20      },

21

22

首先我们配置了一个HBase的表，叫log1,当然，这里是因为程序通过hbase-site.xml获得HBase的链接，所以配置上你看不到HBase相关的信息。接着呢，在SQL 里你就可以对content 做处理了。我这里是把content 转化成了JSON格式字符串。再之后你就可以自己写一个UDF函数之类的做处理了，从而实现你复杂的业务逻辑。我们其实每个字段里存储的都是JSON，所以我其实不关心列名，只要让我拿到所有的列就好。而上面的例子正好能够满足我这个需求了。

而且实现这个HBase DataSource 也很简单，核心逻辑大体如下：


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
1case class HBaseRelation(

2                          parameters: Map[String, String],

3                          userSpecifiedschema: Option[StructType]

4                        )(@transient val sqlContext: SQLContext)

5  extends BaseRelation with TableScan with Logging {

6

7  val hbaseConf = HBaseConfiguration.create()

8

9

10  def buildScan(): RDD[Row] = {

11    hbaseConf.set(TableInputFormat.INPUT_TABLE, parameters(&quot;inputTableName&quot;))

12    val hBaseRDD = sqlContext.sparkContext.newAPIHadoopRDD(hbaseConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

13      .map { line =&gt;

14        val rowKey = Bytes.toString(line._2.getRow)

15

16        import net.liftweb.{json =&gt; SJSon}

17        implicit val formats = SJSon.Serialization.formats(SJSon.NoTypeHints)

18

19        val content = line._2.getMap.navigableKeySet().flatMap { f =&gt;

20          line._2.getFamilyMap(f).map { c =&gt;

21            (Bytes.toString(f) + &quot;:&quot; + Bytes.toString(c._1), Bytes.toString(c._2))

22          }

23        }.toMap

24

25        val contentStr = SJSon.Serialization.write(content)

26

27        Row.fromSeq(Seq(UTF8String.fromString(rowKey), UTF8String.fromString(contentStr)))

28      }

29    hBaseRDD

30  }

31}

32

33

那么我们回过头来，如何让Spark自动发现Schema呢？大体你还是需要过滤所有数据得到列的合集，然后形成Schema的，成本开销很大。我们也可以先将我们的数据转化为JSON格式，然后就可以利用Spark已经支持的JSON格式来自动推倒Schema的能力了。

总体而言，其实并不太鼓励大家使用Spark 对HBase进行批处理，因为这很容易让HBase过载,比如内存溢出导致RegionServer 挂掉，最遗憾的地方是一旦RegionServer 挂掉了，会有一段时间读写不可用，而HBase 又很容易作为实时在线程序的存储，所以影响很大。

链接：https://www.jianshu.com/p/b2fea6687735

{{userData.name}}已认证

前言

整合过程

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

{{userData.name}}已认证

前言

整合过程

Related posts:

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

DevOps基础-1.3-DevOps的原则：三大方法

DevOps基础-4.1-基础架构自动化：基础设施即代码

给 DevOps 初学者的入门指南

带你玩转kubernetes-k8s（第50篇：共享储存原理-[共享储存机制概述]）