Hadoop上理想的reducer数量是多少?

人气:137 发布:2022-10-16 标签: hadoop mapreduce reducers

问题描述

根据 Hadoop wiki 给出的计算理想化简器数量的方法是 0.95 或 1.75 * (nodes * mapred.tasktracker.tasks.maximum)

As given by Hadoop wiki to calculate ideal number of reducers is 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum)

但是什么时候选择0.95,什么时候选择1.75?决定这个乘数时考虑的因素是什么?

推荐答案

假设您的集群中有 100 个可用的 reduce slot.

Let's say that you have 100 reduce slots available in your cluster.

负载因子为 0.95 时,所有 95 个 reduce 任务将同时启动,因为有足够的 reduce 插槽可供所有任务使用.这意味着没有任务将在队列中等待,直到其余任务之一完成.当 reduce 任务小"时,我会推荐这个选项,即完成相对较快,或者它们都需要相同的时间,或多或少.

With a load factor of 0.95 all the 95 reduce tasks will start at the same time, since there are enough reduce slots available for all the tasks. This means that no tasks will be waiting in the queue, until one of the rest finishes. I would recommend this option when the reduce tasks are "small", i.e., finish relatively fast, or they all require the same time, more or less.

另一方面,在负载因子为 1.75 的情况下,将同时启动 100 个 reduce 任务,数量与可用的 reduce 槽一样多,其余 75 个将在队列中等待,直到有一个 reduce 槽可用.这提供了更好的负载平衡,因为如果某些任务比其他任务更重",即需要更多时间,那么它们将不会成为工作的瓶颈,因为其他减少槽,而不是完成它们的任务并等待,现在正在执行队列中的任务.这也减轻了每个reduce任务的负载,因为map输出的数据被分散到了更多的任务中.

On the other hand, with a load factor of 1.75, 100 reduce tasks will start at the same time, as many as the reduce slots available, and the 75 rest will be waiting in the queue, until a reduce slot becomes available. This offers better load balancing, since if some tasks are "heavier" than others, i.e., require more time, then they will not be the bottleneck of the job, since the other reduce slots, instead of finishing their tasks and waiting, will now be executing the tasks in the queue. This also lightens the load of each reduce task, since the data of the map output is spread to more tasks.

如果我可以表达我的意见,我不确定这些因素是否总是理想的.通常,我使用大于 1.75 的因子(有时甚至 4 或 5),因为我正在处理大数据,并且我的数据不适合每台机器,除非我将此因子设置得更高并且负载平衡也更好.

If I may express my opinion, I am not sure if these factors are ideal always. Often, I use a factor greater than 1.75 (sometimes even 4 or 5), since I am dealing with Big Data, and my data does not fit in each machine, unless I set this factor higher and load balancing is also better.

481