洗牌阶段实际上做了什么?

人气：736 发布：2022-10-16 标签： shuffle hadoop mapreduce mapper reducers

问题描述

洗牌阶段实际上做了什么?

What does the shuffling phase actually do?

由于 shuffle 是将 mapper o/p 带到 reducer o/p 的过程，它只是根据 partitioner 中编写的代码将特定键从 mapper 带到特定的 reducer

As shuffling is the process of bringing the mapper o/p to the reducer o/p, it just brings the specific keys from the mappers to the particular reducers based on the code written in partitioner

例如.mapper 1 的 o/p 是 {a,1} {b,1}

eg. the o/p of mapper 1 is {a,1} {b,1}

mapper 2的o/p是{a,1} {b,1}

the o/p of mapper 2 is {a,1} {b,1}

在我的分区器中，我已经写了所有以'a'开头的键都将进入reducer 1，所有以'b'开头的键都将进入reducer 2，因此o/p将是:

and in my partitioner, I have written that all keys starting with 'a' will go to reducer 1 and all keys starting with 'b will go to reducer 2 so the o/p would be:

减速器 1:{a,1}{a,1}

reducer 1: {a,1}{a,1}

减速器 2:{b,1}{b,1}

reducer 2: {b,1}{b,1}

可能性 - B

或者与上述过程一起，它是否也对键进行分组:

Possibility - B

Or along with he above process, does it also groups the keys:

所以，o/p 是:

减速器 1:{a,[1,1]}

reducer 1: {a,[1,1]}

减速器 2:{b,[1,1]}

reducer 2: {b,[1,1]}

在我看来，我认为它应该是 A，因为键的分组必须在排序之后进行，因为排序只是为了让 reducer 可以轻松指出一个键结束而另一个键开始的时间.如果是，键分组实际发生在什么时候，请详细说明.

In my opinion I think it should be A because grouping of keys must take place after sorting because sorting is only done so that reducer can easily point out when one key is ending and the other key is starting. If yes, when does grouping of keys actually happen, please elaborate.

推荐答案

Mapper 和 Reducer 不是独立的机器，只是独立的代码.映射代码和归约代码都运行在集群中的同一台机器上.

Mappers and Reducers are not separate machines but just separate code. Both, the mapping code as well as the reducing code runs on the same set machines present in the cluster.

所以，集群中的所有机器都运行了mapper之后，结果是:

So, after all machines in the cluster have run mapper, the results are:

在节点上本地分箱(将其视为本地分组")；并且，在集群上的所有节点间随机/重新分配.

将步骤 2 视为全局分组"，因为它的完成方式是，属于一个键的所有值都转到其分配的唯一节点.

Consider the step-2 a "global-grouping" because it is done in a manner that all values belonging to one key, go to their assigned unique node.

现在，节点在其内存中的 (key, value) 对上运行 Reducer 代码.

Now, the nodes run the Reducer code on the (key, value) pairs residing on their memory.

954

上一篇：为什么 YARN 上有 mapreduce.jobtracker...

下一篇：Hive 分组中的减速器数量和计数(不同)