Question

我打算在Hadoop 0.20.2中将一些代码插入到TeraSort类的映射器中。但是，在查看源代码后，我无法找到实现映射器的段。通常，我们将看到一个名为job.setMapperClass（）的方法，它指示mapper类。但是，对于TeraSort，我只能看到像setInputformat，setOutputFormat这样的东西。我无法找到mapper和reduce方法的调用位置？任何人都可以给出一些提示吗？谢谢，源代码是这样的，

public int run(String[] args) throws Exception {
   LOG.info("starting");
   JobConf job = (JobConf) getConf();
   Path inputDir = new Path(args[0]);
   inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
   Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
   URI partitionUri = new URI(partitionFile.toString() +
                           "#" + TeraInputFormat.PARTITION_FILENAME);
   TeraInputFormat.setInputPaths(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
   job.setJobName("TeraSort");
   job.setJarByClass(TeraSort.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
   job.setInputFormat(TeraInputFormat.class);
   job.setOutputFormat(TeraOutputFormat.class);
   job.setPartitionerClass(TotalOrderPartitioner.class);
   TeraInputFormat.writePartitionFile(job, partitionFile);
   DistributedCache.addCacheFile(partitionUri, job);
   DistributedCache.createSymlink(job);
   job.setInt("dfs.replication", 1);
   // TeraOutputFormat.setFinalSync(job, true);                                                                                                                                                                                             
   job.setNumReduceTasks(0);
   JobClient.runJob(job);
   LOG.info("done");
   return 0;
 }

对于其他类，如TeraValidate，我们可以找到像

这样的代码

job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);

我看不到TeraSort的这种方法。

谢谢，

Answer 1

为什么排序需要为其设置Mapper和Reducer类？

默认值是标准Mapper（以前的身份映射器）和标准Reducer。这些是您通常继承的类。

你基本上可以说，你只是从输入中发出一切，让Hadoop做自己的排序。所以排序工作是“默认”。

Answer 2

托马斯的答案是对的，即mapper和reducer是身份，因为在应用reduce函数之前对shuffled数据进行了排序。关于terasort的特别之处是它的自定义分区器（它不是默认的哈希函数）。您应该从这里Hadoop's implementation for Terasort了解更多相关信息。它声明

“TeraSort是标准的map / reduce排序，除了自定义分区器使用N-1个采样键的排序列表，这些键定义了每个reduce的键范围。特别是，所有键都是样本[i - 1 ]＆lt; = key＆lt; sample [i]被发送以减少i。这保证了reduce i的输出都小于reduce i + 1的输出。“

为什么不使用hadoop TeraSort的mapper / reducer

2 个答案: