熟悉常用的HBase操作，编写MapReduce作业

释放双眼，带上耳机，听听看~！

以下关系型数据库中的表和数据，要求将其转换为适合于HBase存储的表并插入数据：

学生表（Student）（不包括最后一列）


学号（S_No）	姓名（S_Name）	性别（S_Sex）	年龄（S_Age）	课程（course）
2015001	Zhangsan	male	23
2015003	Mary	female	22
2015003	Lisi	male	24	数学（Math）85

流程：1.开启dfs和hbase

熟悉常用的HBase操作，编写MapReduce作业

验证是否成功开启

熟悉常用的HBase操作，编写MapReduce作业

创建表，但是因为我在shell命令写的时候总会卡死，之后进程就直接被杀死了，所以下面我自己手写

熟悉常用的HBase操作，编写MapReduce作业

用Hadoop提供的HBase Shell命令完成相同任务：

列出HBase所有的表的相关信息；list


1
2
1hbase(main):008:0&gt;list

2

在终端打印出学生表的所有记录数据；


1
2
1hbase(main):009:0&gt;scan &#x27;student&#x27;

2

向学生表添加课程列族；


1
2
1hbase(main):010:0&gt;alter &#x27;student&#x27;,{NAME=&gt;&#x27;course&#x27;,VERSIONS=&gt;3}

2

向课程列族添加数学列并登记成绩为85；


1
2
1&#x27;hbase(main):011:0&gt;put &#x27;student&#x27;,&#x27;2015003;&#x27;,&#x27;course:Math&#x27;,&#x27;85&#x27;

2

删除课程列；


1
2
1disable &#x27;student&#x27;

2

统计表的行数；count 's1'


1
2
1count &#x27;student&#x27;

2

清空指定的表的所有记录数据；truncate 's1'


1
2
1truncate &#x27;student&#x27;

2

用Python编写WordCount程序任务


程序	WordCount
输入	一个包含大量单词的文本文件
输出	文件中每个单词及其出现次数（频数），并按照单词字母顺序排序，每个单词和其频数占一行，单词和频数之间有间隔

编写map函数，reduce函数
将其权限作出相应修改
本机上测试运行代码
放到HDFS上运行
下载并上传文件到hdfs上
用Hadoop Streaming命令提交任务

流程：

1.创建mapper.py文件


1
2
3
1cd /home/hadoop/wc

2sudo gedit mapper.py

3

2.map函数


1
2
3
4
5
6
7
8
1#!/usr/bin/env python

2import sys

3for i in stdin:

4    i = i.strip()

5    words = i.split()

6    for word in words:

7    print &#x27;%s\t%s&#x27; % (word,1)

8

3.reduce函数


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1#!/usr/bin/env python

2from operator import itemgetter

3import sys

4

5current_word = None

6current_count = 0

7word = None

8

9for i in stdin:

10    i = i.strip()

11    word, count = i.split(&#x27;\t&#x27;,1)

12    try:

13    count = int(count)

14    except ValueError:

15    continue

16

17    if current_word == word:

18    current_count += count 

19    else:

20    if current_word:

21        print &#x27;%s\t%s&#x27; % (current_word, current_count)

22    current_count = count

23    current_word = word

24

25if current_word == word:

26    print &#x27;%s\t%s&#x27; % (current_word, current_count)

27

4.创造reduce.py文件


1
2
3
1cd /home/hadoop/wc

2sudo gedit reducer.py

3

5.赋予权限及测试代码


1
2
3
4
5
6
1chmod a+x /home/hadoop/mapper.py

2

3echo &quot;foo foo quux labs foo bar quux&quot; | /home/hadoop/wc/mapper.py

4

5echo &quot;foo foo quux labs foo bar quux&quot; | /home/hadoop/wc/mapper.py | sort -k1,1 | /home/hadoop/wc/reducer.p

6

6.下载文件上传


1
2
3
4
5
6
7
8
9
1#上传

2cd  /home/hadoop/wc

3wget http://www.gutenberg.org/files/5000/5000-8.txt

4wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt

5

6#下载

7cd /usr/hadoop/wc

8hdfs dfs -put /home/hadoop/hadoop/gutenberg/*.txt /user/hadoop/input

9

用Hadoop Streaming命令提交任务

接下来配置.bashrc文件，将streaming的路径配置到环境变量中


1
2
3
4
5
1

2export HADOOP_HOME=/usr/local/hadoop

3export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar

4

5

配置好后在目录下创建run.sh
内容如下：


1
2
3
4
5
6
7
8
9
10
11
1

2hadoop jar $STREAM \

3-D stream.non.zero.exit.is.failure=false \

4-file /home/hadoop/mapper.py \

5-mapper &#x27;python /home/hadoop/mapper.py&#x27; \

6-file /home/hadoop/reducer.py \

7-reducer &#x27;python /home/hadoop/reducer.py&#x27; \

8-input /user/hadoop/input/*.txt \

9-output /user/hadoop/wcoutput

10

11

在配置mapper和reducer中，加入了python，不然运行出错。

还有上面的命令中加入-D stream.non.zero.exit.is.failure=false是因为运行时抛出异常


1
2
1java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed wi  

2

这个异常是streaming默认的情况下，mapper和reducer的返回值不是0，被认为异常任务，将被再次执行，默认尝试4次都不是0，整个job都将失败。

现在在本目录下写入命令source run.sh即可运行，之后在运行命令
hdfs dfs -cat wcoutput/*就可看见执行后代码后的结果

{{userData.name}}已认证

熟悉常用的HBase操作，编写MapReduce作业

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

{{userData.name}}已认证

Related posts:

OpenSSH-8.7p1离线升级修复安全漏洞

设计模式的设计原则

Kubernetes（一）--简介

hadoop生态系统学习之路（七）impala的简单使用以及与hive的区别

hadoop项目实战--ETL--（三）实现mysql表到HIVE表的全量导入与增量导入

清除MAC 可清除空间