1. 系统环境
Ubuntu 16.04
vmware
Hadoop 2.7.0
Java 1.8.0_111
master:192.168.19.128
slave1:192.168.19.129
slave2:192.168.19.130
2. 部署步骤
此处选择在 master 节点安装 Pig。
1.下载 Pig:http://www.apache.org/dyn/closer.cgi/pig
1
2 1wget http://apache.fayea.com/pig/pig-0.16.0/pig-0.16.0.tar.gz
2
2.在也相应目录下解压。
1
2 1tar -zxvf pig-0.16.0.tar.gz
2
3.在 .bashrc 中配置 Pig 环境。
Mapreduce模式:需要添加环境变量 PIG_CLASSPATH=${HADOOP_HOME}/conf/,指向hadoop的conf目录。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 1# set hadoop classpath
2export HADOOP_HOME=/home/hadoop/software/hadoop-2.7.0
3export HADOOP_MAPRED_HOME=$HADOOP_HOME
4export HADOOP_COMMON_HOME=$HADOOP_HOME
5export HADOOP_HDFS_HOME=$HADOOP_HOME
6export YARN_HOME=$HADOOP_HOME
7export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native
8export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
9export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
10export HADOOP_PREFIX=$HADOOP_HOME
11
12# set hadoop cluster login username
13export HADOOP_USER_NAME=hadoop
14
15############### config pig env ##################
16export PIG_HOME=/home/hadoop/software/pig-0.16.0
17export PIG_CONF_DIR=$PIG_HOME/conf
18export PIG_CLASSPATH=$HADOOP_CONF_DIR
19export PATH=$PIG_HOME/bin:$PATH
20export CLASSPATH=$CLASSPATH:.:$PIG_HOME/bin
21############### end config pig env ##############
22
4.以 mapreduce 模式启动Pig。
1
2 1pig -x mapreduce
2
3. 测试案例
help 查看帮助文档,Pig 支持File system commands、Diagnostic commands 和 Utility Commands:
例如 fs -mkdir /pig_dir 用于在 HDFS 中创建目录。
注:在运行 Pig 脚本之前开启 job history server,否则会报如下错误:
开启 job history server:(配置参考:hadoop yarn jobhistoryserver 配置)
1
2
3 1mr-jobhistory-daemon.sh start historyserver
2yarn-daemon.sh start timelineserver
3
测试数据下载:ncdc_data.txt
将下载文件上传到HDFS指定目录:
1
2
3 1wget http://www.blogjava.net/Files/redhatlinux/ncdc_data.txt
2hdfs dfs -put ncdc_data.txt /user/hadoop/pig/
3
运行 pig 脚本:
1
2
3
4
5
6
7 1A = LOAD '/user/hadoop/pig/ncdc_data.txt' USING PigStorage(':') AS (year:int, temp:int, quality:int);
2B = FILTER A BY temp != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
3C = GROUP B BY year;
4D = FOREACH C GENERATE group, MAX(B.temp) AS max_temp;
5describe D;
6DUMP D;
7
运行输出结果:
保存运行结果:
1
2 1STORE D INTO '/user/hadoop/pig/result_max.txt' USING PigStorage(':');
2
查看运行的结果: