为整个Mapper / Reducer定义一个可写

时间:2015-11-12 10:18:31

标签: hadoop

我在某处读到,如果我们在创建Mapper / Reducer时定义输出可写,并且在Mapper / Reducer中我们应该只设置可写的值,而不是为每个输出记录创建可写的。

例如(伪代码):

map(){
     IntWritable idWritable = new IntWritable(outputValue);
     emit(idWritable);
}

比以下更优秀:

{{1}}

这是真的吗?在创建Mapper / Reducer时定义输出可写入是否真的是一个好习惯,它将用于所有输出记录?

1 个答案:

答案 0 :(得分:1)

Yes this is true. In your second example you're creating a brand new IntWritable every time you process a record. This requires overhead for new memory allocation, and also means that the old IntWritable has to be garbage collected at some point. If you're processing millions of records and using a complex Writable (say with several ints and Strings), the heap can be filled very quickly.

Alternately, by just re-setting the value within the same object, no new memory needs to be allocated and no garbage collection needs to take place. It's much faster, but I can recommend doing your own experiments to confirm this.