mapreduce reducer大小是错误的

时间:2015-03-17 18:04:11

标签: hadoop mapreduce

我正在编写一个简单的MapReduce程序,用于计算每行在输入中出现的次数。我的目标是检查两个目录是否包含相同的数据。因此,在reduce阶段,我的目标是检查每个键是否恰好出现两次(每个输入目录中有一个)

这是我的代码 -

public class ResultsValidator extends Configured implements Tool {

    public static class TuplesScanner extends Mapper<BytesWritable, NullWritable, BytesWritable, LongWritable> {

        private LongWritable one = new LongWritable(1);

        @Override
        public void map(BytesWritable row, NullWritable ignored, Context context) throws IOException, InterruptedException {
            context.write(row, one);
        }
    }

    public static class TuplesCombiner extends Reducer<BytesWritable, LongWritable, BytesWritable, LongWritable> {

        @Override 
        public void reduce(BytesWritable row, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (LongWritable value : values) {
                sum += value.get();
            }
            context.write(row, new LongWritable(sum));
        }
    }

    public static class TuplesReducer extends Reducer<BytesWritable, LongWritable, BytesWritable, NullWritable> {

        @Override 
        public void reduce(BytesWritable row, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (LongWritable value : values) {
                sum += value.get();
            }
            if (sum != 2) {
                context.write(row, NullWritable.get());
            }
        }
    }

    public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Job job = Job.getInstance(getConf());

        Path inputDir0 = new Path(args[0]);
        Path inputDir1 = new Path(args[1]);
        Path outputDir = new Path(args[2]);
        int reducersNum = Integer.parseInt(args[3]);
        if (outputDir.getFileSystem(getConf()).exists(outputDir)) {
          throw new IOException("Output directory " + outputDir + 
                                " already exists.");
        }
        FileInputFormat.addInputPath(job, inputDir0);
        FileInputFormat.addInputPath(job, inputDir1);
        FileOutputFormat.setOutputPath(job, outputDir);
        job.setJobName("ResultsValidator");
        job.setJarByClass(ResultsValidator.class);
        job.setMapperClass(TuplesScanner.class);
        job.setCombinerClass(TuplesCombiner.class);
        job.setReducerClass(TuplesReducer.class);
        job.setNumReduceTasks(reducersNum);
        job.setMapOutputKeyClass(BytesWritable.class);
        job.setMapOutputValueClass(LongWritable.class);
        job.setOutputKeyClass(BytesWritable.class);
        job.setOutputValueClass(NullWritable.class);
        job.setInputFormatClass(ResultsValidatorInputFormat.class);
        job.setOutputFormatClass(ResultsValidatorOutputFormat.class);
        return job.waitForCompletion(true) ? 0 : 1;
    }

     public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new ResultsValidator(), args);
        System.exit(res);
    }
}

我无法找到在reduce阶段的iterable中得到错误数字的原因。在日志中,我发现每个reducer获得的数字等于合并后的shuffle数。

我哪里错了?

0 个答案:

没有答案
相关问题