Hive实现词频统计

安全运维
21年12月12日
编辑

aqzt

释放双眼，带上耳机，听听看~！

Hive中提供了类似于SQL语言的查询语言——HiveQL，可以通过 HiveQL语句快速实现简单的 MapReduce统计， Hive 自身可以将 HiveQL 语句快速转换成 MapReduce 任务进行运行，而不必开发专门的 MapReduce 应用程序，因而十分适合数据仓库的统计分析。通过一个简单的词频统计来初步认识hive

1.本地创建两个文本文件


1
2
3
4
5
6
7
1cd /usr/local/hadoop/input

2

3echo “hello world”&gt; file1.txt

4

5echo “hello hadoop”&gt; file2.txt

6

7

2.将文件上传至hdfs中（因为hive的的操作是基于hdfs文件系统）


1
2
3
4
5
1./bin/hdfs dfs -mkdir -p /wordcount/input

2

3./bin/hdfs dfs -put /usr/local/hadoop/input/*.txt /wordcount/input

4

5

3.在hive下通过如下HiveQL语句实现统计功能


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1create table wordcount(line string);    //表有一个string类型的字段

2

3load data inpath &#x27;/wordcount/input&#x27; overwrite into table wordcount ;   //把数据导入到wordcount表

4

5create table word_count as 

6

7select word,count(1) as count from 

8

9(select explode(split(line,&#x27; &#x27;)) as word from wordcount) w    //通过explode函数把wordcount表变成字段为word的w表

10

11group by word

12

13order by word;

14

15

4.查找结果


1
2
1select * from word_count;

2

{{userData.name}}已认证

Hive实现词频统计

MySQL，Redis，MongoDB 三种数据库优势

Ubuntu上NFS的安装配置

{{userData.name}}已认证

Related posts:

MySQL，Redis，MongoDB 三种数据库优势

Ubuntu上NFS的安装配置

R利剑NoSQL系列文章 之 Hive

MYSQL优化和备份

决策树算法

Kafka、RabbitMQ、RocketMQ等 消息中间件 介绍和对比

R利剑NoSQL系列文章之 Hive

Kafka、RabbitMQ、RocketMQ等消息中间件介绍和对比