Clarification regarding this map reduce word count example?

时间:2015-05-18 17:28:25

标签: hadoop mapreduce

I am studying map reduce, and I have a question regarding the basic word count example of map reduce. Say my text is

My name is X Y X.

here is the map class, I am referring to

  public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, 
                OutputCollector<Text, IntWritable> output, 
                Reporter reporter) throws IOException {
  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, one);
  }
}

}

When text is processed by this Map job, it will result into

My 1
name 1
is 1
X 1
Y 1
X 1     

Then after shuffer and sort, all of the same keys will be grouped and we can do the addition for the final count. In this example both of X's will be added.

My question is that, what if I do the addition in the map job itself, by keeping a map of word and count. Then then iterating over the map, and putting the count in the output. Will it have an impact on the map reduce job? The output will still be the same; However, will it be more efficient doing it like that, as there will be less entries for shuffle,sort and reducer to operate on?

Is my thinking of doing the addition in the map job correct?

1 个答案:

答案 0 :(得分:1)

是的,你应该保持你的Map输出尽可能小。进行初步计数将减少通过系统的数据量。请注意,您仍然需要一个减少作业,为每个单词添加计数,您的输入可以在Y处拆分,因此两个“X”单词将转到不同的映射器。

此外,您可以为MapReduce作业执行的另一项高效工作是使用Combiners。这些是在映射步骤完成后立即在映射器节点上的减少步骤。因此,您可以进一步减少Map作业输出。