使用ChainMapper

时间:2016-03-11 07:14:07

标签: hadoop mapreduce hadoop2 hadoop-partitioning bigdata

我有一个ChainMapper,它有2个与之关联的映射器。我正在尝试在链中的最后一个映射器上执行TotalOrderPartition而取得了很大的成功。

有没有办法根据链中第N个映射器上的一些采样来强制执行分区?

public class WordCountChain extends Configured implements Tool
{
    @Override
    public int run(String[] args) throws Exception 
    {
        Job job = new Job(getConf(), "Word Count V1 (Chain)");
        job.setJarByClass(getClass());

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        /*********** First Mapper ***********/
        Configuration wcpMapperConf = new Configuration(false);
        ChainMapper.addMapper(job, WordCountPreparationMapper.class, LongWritable.class, Text.class, Text.class, IntWritable.class, wcpMapperConf);

        /*********** Second Mapper ***********/
        Configuration wcMapperConf = new Configuration(false);
        ChainMapper.addMapper(job, Mapper.class, Text.class, IntWritable.class, Text.class, IntWritable.class, wcMapperConf);

        /******* This enforces the Sampling/Partitioning over the First Mapper *******/
        //job.setInputFormatClass(SequenceFileInputFormat.class);
        //InputSampler.Sampler<Text, IntWritable> sampler = new InputSampler.RandomSampler<Text, IntWritable>(0.1, 10000, 10);
        //InputSampler.writePartitionFile(job, sampler);
        //job.addCacheFile( new URI( TotalOrderPartitioner.getPartitionFile(getConf()) ) );

        job.setNumReduceTasks(10);
        job.setReducerClass(WordCountReducer.class);
        return (job.waitForCompletion(true) ? 0 : 1);
     }

     public static void main(String[] args) throws Exception 
     {
        int exitCode = ToolRunner.run(new WordCountChain(), args);
        System.exit(exitCode);
     }
}

1 个答案:

答案 0 :(得分:1)

不幸的是,RandomSampler在作业开始之前运行,实际上它在你调用

时运行
InputSampler.writePartitionFile(job, sampler);

这意味着它不会在任何Mapper的输出上运行,而是在作业的输入数据集上运行。

如果需要根据第N个Mapper的输出进行分区,可以将作业拆分为两个作业,即仅映射作业和mapreduce作业。第一个将映射器链运行到第N个映射器,然后只存储它的输出。第二个作业将根据输入(将是第N个Mapper的输出)进行采样和分区,然后运行其余的Mappers和Reducer。