释放双眼，带上耳机，听听看~！

需求

公司里有两个部门，一个叫hive，一个叫pig，这两个部门都需要使用公司里的hadoop集群。于是问题来了，因为hadoop默认是FIFO调度的，谁先提交任务，谁先被处理，于是hive部门很担心pig这个部门提交一个耗时的任务，影响了hive的业务，hive希望可以和pig在高峰期时，平均使用整个集群的计算容量,互不影响。

思路

hadoop的默认调度器是FIFO，但是也有计算容量调度器，这个调度器可以解决上述问题。可以在hadoop里配置三个队列，一个是default，一个是hive，一个是pig。他们的计算容量分别是30%,40%,30%.这样hive和pig这两个部门，分为使用hive和pig两个队列，其中default作为其他部门或者临时使用。但是,如果hive部门和pig部门又希望，在平常时，没有人用集群的时候，hive或者部门可以使用100%的计算容量。

解决方法

修改hadoop的配置文件mapred-site.xml:


1
2
1&lt;property&gt; &lt;name&gt;mapred.jobtracker.taskScheduler&lt;/name&gt; &lt;value&gt;org.apache.hadoop.mapred.CapacityTaskScheduler&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.queue.names&lt;/name&gt; &lt;value&gt;default,hive,pig&lt;/value&gt; &lt;/property&gt;

2

在capacity-scheduler.xml文件中填写如下内容：


1
2
1&lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.hive.capacity&lt;/name&gt; &lt;value&gt;40&lt;/value&gt; &lt;description&gt;Percentage of the number of slots in the cluster that are to be available for jobs in this queue. &lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.hive.maximum-capacity&lt;/name&gt; &lt;value&gt;-1&lt;/value&gt; &lt;description&gt; &lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.hive.supports-priority&lt;/name&gt; &lt;value&gt;true&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.hive.minimum-user-limit-percent&lt;/name&gt; &lt;value&gt;100&lt;/value&gt; &lt;description&gt; &lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.hive.user-limit-factor&lt;/name&gt; &lt;value&gt;3&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.hive.maximum-initialized-active-tasks&lt;/name&gt; &lt;value&gt;200000&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.hive.maximum-initialized-active-tasks-per-user&lt;/name&gt; &lt;value&gt;100000&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.hive.init-accept-jobs-factor&lt;/name&gt; &lt;value&gt;10&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;!-- pig --&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.pig.capacity&lt;/name&gt; &lt;value&gt;30&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.pig.maximum-capacity&lt;/name&gt; &lt;value&gt;-1&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.pig.supports-priority&lt;/name&gt; &lt;value&gt;true&lt;/value&gt; &lt;description&gt;If true, priorities of jobs will be taken into account in scheduling decisions. &lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.pig.minimum-user-limit-percent&lt;/name&gt; &lt;value&gt;100&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.pig.user-limit-factor&lt;/name&gt; &lt;value&gt;4&lt;/value&gt; &lt;description&gt;The multiple of the queue capacity which can be configured to allow a single user to acquire more slots. &lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.pig.maximum-initialized-active-tasks&lt;/name&gt; &lt;value&gt;200000&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.pig.maximum-initialized-active-tasks-per-user&lt;/name&gt; &lt;value&gt;100000&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.pig.init-accept-jobs-factor&lt;/name&gt; &lt;value&gt;10&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;!-- default --&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.default.capacity&lt;/name&gt; &lt;value&gt;30&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.default.maximum-capacity&lt;/name&gt; &lt;value&gt;-1&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.default.supports-priority&lt;/name&gt; &lt;value&gt;true&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.default.minimum-user-limit-percent&lt;/name&gt; &lt;value&gt;100&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.default.user-limit-factor&lt;/name&gt; &lt;value&gt;4&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.default.maximum-initialized-active-tasks&lt;/name&gt; &lt;value&gt;200000&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.default.maximum-initialized-active-tasks-per-user&lt;/name&gt; &lt;value&gt;100000&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.capacity-scheduler.queue.default.init-accept-jobs-factor&lt;/name&gt; &lt;value&gt;10&lt;/value&gt; &lt;description&gt;&lt;/description&gt; &lt;/property&gt;

2

这里配置了三个队列，分别是hive,pig,default,hive的容量是40%,由属性mapred.capacity-scheduler.queue.hive.capacity决定，其他队列的容量同理可得。

需要配置hive,pig,default可以抢占整个集群的资源,由属性mapred.capacity-scheduler.queue.hive.user-limit-factor绝对，hive队列这个值是3，所以用户可以使用的资源限量是40% * 3 =120%,所有有效计算容量是集群的100%.其他队列的最大集群计算容量同理可得。

如何使用该队列

mapreduce:在Job的代码中，设置Job属于的队列,例如hive：


1
2
1conf.setQueueName(&quot;hive&quot;);

2

hive:在执行hive任务时，设置hive属于的队列,例如pig:


1
2
1set mapred.job.queue.name=pig;

2

动态更新集群队列和容量

生产环境中，队列及其容量的修改在现实中是不可避免的，而每次修改，需要重启集群，这个代价很高，如果修改队列及其容量的配置不重启呢:

1.在主节点上根据具体需求，修改好mapred-site.xml和capacity-scheduler.xml

2.把配置同步到所有节点上

3.使用hadoop用户执行命令:hadoop mradmin -refreshQueues

这样就可以动态修改集群的队列及其容量配置，不需要重启了，刷新mapreduce的web管理控制台可以看到结果。

注意:如果配置没有同步到所有的节点，一些队列会无法启用。

最后拷贝
capacity-scheduler
的jar包到hadoop安装路径的lib目录下：

cp contrib/capacity-scheduler/hadoop-capacity-scheduler-0.20.203.0.jar ./lib/

重启jobtracker，搞定！

附注：

* capacity-scheduler*的一个小缺陷：假设任务A（任务A指定使用队列1：目前状态–忙），任务B（任务B指定使用队列2：目前状态–空闲），capacity-scheduler的漕位抢占机制是将mapred任务A分配到队列2，假设任务A分配之后队列2又接收到了任务B，则任务A的优先级低于任务B，则调度器将优先分配队列2的槽位给任务B，从而导致任务A分配到的资源很少，且即使一段时间后队列1空闲，也无法将任务A重新分配到队列1

{{userData.name}}已认证

Hadoop计算能力调度器应用和配置

需求

思路

解决方法

如何使用该队列

动态更新集群队列和容量

基于spring boot和mongodb打造一套完整的权限架构（一）

Ubuntu上NFS的安装配置

{{userData.name}}已认证

需求

思路

解决方法

如何使用该队列

动态更新集群队列和容量

Related posts:

基于spring boot和mongodb打造一套完整的权限架构（一）

Ubuntu上NFS的安装配置

ELK(ElasticSearch, Logstash, Kibana)搭建实时日志分析平台

Redis 实现实时排行榜

LINUX安装NFS 服务，通过NFS共享文件夹(linux到linux)

kafka集群搭建