如何将两个映射器组合到一个reducer

时间:2016-04-25 18:01:39

标签: java hadoop mapreduce key-value

我使用hadoop来比较两个文件。我正在使用两个映射器,每个文件转到一个映射和一个reducer。第一个地图将获得一个普通的文本文件,第二个地图将在每一行中获得一个具有此格式的文件:

word 1 or -1

地图的输入是:

public void map(LongWritable key, Text value, Context context) 

第一张地图输出将是:

key:word value:0

和第二个映射器输出将是:

word 1 or -1

减速器的输入是:

public void reduce(Text key, Iterable<IntWritable> values, Context context) 

reducer的输出是:

context.write(key, new IntWritable(sum));

我得到的结果是分别来自每个地图,我希望Reducer从两个地图中获取相同的键/值并将其合并为一个结果。 这是代码。

public class CompareTwoFiles extends Configured implements Tool {
static ArabicStemmer Stemmer=new ArabicStemmer();
String ArabicWord="";

public static class Map extends Mapper <LongWritable, Text, Text, IntWritable> {

int n=0;
private Text num = new Text();
private Text word = new Text();
@Override    
public void map(LongWritable key, Text value, Context context)  throws IOException, InterruptedException {

String line = value.toString();
String token="";

StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
token=tokenizer.nextToken();
Stemmer.stemWord(token);
word.set(token);
context.write(word,new IntWritable(0));
}
}
}

public static class Map2 extends Mapper <LongWritable, Text, Text, IntWritable> {
int n=0;
private Text word = new Text();  
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String token="";

if (line.contains("1") && !line.contains("-1"))
{
n=1;
}
else if (line.contains("-1"))
{
n=-1;
}
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
token=tokenizer.nextToken();
if(!(token.equals("1"))&& !(token.equals("-1")))
{word.set(token);
context.write(word,new IntWritable(n));
}
}
}
}

public static class Reduce extends  Reducer<Text, IntWritable, Text, IntWritable> {

Text sumT= new Text();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
int num=0;
int[] intArr =new int[2];
boolean flag=false;
int i=0;

while (values.iterator().hasNext()) {            
sum += values.iterator().next().get();
}   

if(sum!=0){
context.write(key, new IntWritable(sum));
}   
}
}
public static void main(String[] args) throws Exception {
           int res = ToolRunner.run(new Configuration(), new CompareTwoFiles(), args);
System.exit(res);
}
@Override
public int run(String[] args) throws Exception {

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://localhost:8020");
conf.set("hadoop.job.ugi", "hdfs");
Job job = new Job(conf);
job.setJarByClass(CompareTwoFiles.class);
job.setJobName("compare");
job.setReducerClass(Reduce.class);
job.setMapperClass(Map.class);
job.setMapperClass(Map2.class);
job.setCombinerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, Map2.class); 
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
return job.waitForCompletion(true) ? 0 : 1;
}

我得到的结果是这样的:

  

第一张地图
  w1 0
  w2 0
  第二张图   w1 1
  w2 3
  w3 -1

1 个答案:

答案 0 :(得分:0)

MapReduce的整个概念是Mapper可以为每个键发出一个值,在你的情况下每个单词一个值,然后每个键都有一个Reducer(在你的情况下,一个Reducer应该接收一个的所有计数)字)。也就是说,在Mapper中,您将为已注册的每个单词写出[key,value]之类的内容。一次运行只能有一个Mapper类和一个Reducer类。

在您的情况下,听起来MapReduce并不适合您的问题。将一个文件与另一个文件进行比较不一定是通过分区和并行化自然倾向于提高效率的问题。您可以做的是对文本文件进行分区,并将文本分区和整个word 1 or -1文件发送到每个Mapper。然后,Reducers将计算每个单词的总和/值。

您也可以在此处发布Mapper和Reducer类。