Question

我使用CDH5.1.0（hadoop 2.3.0）。 2个名称节点（2个32GB RAM，2个内核）和3个数据节点（3个16GB RAM，2个内核）

我正在从默认队列中的单个用户安排mapreduce作业（没有其他用户，也没有配置其他队列）。

使用容量调度程序时，会发生以下情况：我可以提交多个作业，但只能同时执行2个作业（状态“正在运行”）。

使用公平调度程序时，会发生以下情况：我正在提交多个作业，并且集群/调度程序将4个作业设置为“正在运行”状态。这些工作永远保持5％的进步。如果单个工作被杀，新工作将被设置为5％的“运行”状态，同样没有进一步的进展。只有在少于4个作业且没有其他作业提交到队列后，作业才会开始执行。

我已多次重新配置群集，但在使用容量调度程序时无法增加正在运行的作业数，或者在使用公平调度程序时避免挂起作业

我的问题是 - 如何配置cluster / yarn / scheduler / dynamic和static资源池以使调度工作？

以下是一些配置参数：

yarn.scheduler.minimum-allocation-mb = 2GB
yarn.scheduler.maximum-allocation-mb = 12GB
yarn.scheduler.minimum-allocation-vcores = 1
yarn.scheduler.maximum-allocation-vcores = 2
yarn.nodemanager.resource.memory-mb = 12GB
yarn.nodemanager.resource.cpu-vcores  = 2
mapreduce.map.memory.mb = 12GB
mapreduce.reduce.memory.mb = 12GB
mapreduce.map.java.opts.max.heap = 9.6GB
mapreduce.reduce.java.opts.max.heap = 9.6GB
yarn.app.mapreduce.am.resource.mb = 12GB
ApplicationMaster Java Maximum Heap Size = 788MB
mapreduce.task.io.sort.mb = 1GB

我已将静态和动态资源池保留为默认（cloudera）设置（例如，Max Running Apps设置为空）

Answer 1

不是解决方案，但可能的解决方法

在某些时候，我们与来自MapR咨询公司的Christian Neundorf讨论了这个问题，他声称FairScheduler中存在一个死锁错误（不是CDH特定的，而是标准的hadoop！）。

他提出了这个解决方案，但我不记得我们是否尝试过。请自担风险使用，我不保证这实际上有效，并且只发布给那些非常绝望并且愿意尝试使您的应用运行的人：

在yarn-site.xml中（不知道为什么必须设置）

<property>
    <name>yarn.scheduler.fair.user-as-default-queue</name>
    <value>false</value>
    <description>Disable username for default queue </description>
</property>

在fair-scheduler.xml中

<allocations>
    <queue name="default">
         <!-- you set an integer value here which is number of the cores at your disposal minus one (or more) -->
        <maxRunningApps>number of cores - 1</maxRunningApps>
   </queue>
</allocations>

Answer 2

减少这些参数：

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
yarn.app.mapreduce.am.resource.mb

到6Gb（并相应减少堆大小）。

使用当前配置，您只能运行三个容器（每个节点一个）。

YARN作业至少需要运行两个容器（一个容器用于ApplicationMaster，另一个容器用于Map或Reduce任务）。因此，当您为三个不同的作业启动树ApplicationMaster时，您可以很容易地遇到这种情况，因为您没有任何容器可以执行实际的Map / Reduce处理。

此外，您应该限制群集中可以并行运行的应用程序数量（因为您没有那么多资源）到2或3。

hadoop公平调度程序和容量调度程序都没有按预期进行调度

2 个答案: