Hadoop - 查找IP命中总数和唯一IP地址,然后查找平均值(总IP命中数/唯一IP数)

时间:2015-02-21 13:29:22

标签: java hadoop mapreduce average

我正在尝试学习Hadoop。我编写了一个地图缩减代码,用于查找IP命中总数并查找唯一的IP地址,然后查找平均值(总IP命中数/唯一ID)。

然而,我获得所有IP的输出以及命中数。但我不能得到相同的平均值。

代码:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public final class IPAddress {
    private final static IntWritable ONE = new IntWritable(1);

    static int totalHits = 0, uniqueIP = 0;
    public final static void main(final String[] args) throws Exception 
    {
        final Configuration conf = new Configuration();

        final Job job = new Job(conf, "IPAddress");
        job.setJarByClass(IPAddress.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(IPMap.class);
        job.setCombinerClass(IPReduce.class);
        job.setReducerClass(IPReduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
        int average = totalHits/uniqueIP;
        System.out.print("Average is :"+average+"\n");
    }

    public static final class IPMap extends Mapper<LongWritable, Text, Text, IntWritable> 
    {
        private final Text mapKey = new Text();

        public final void map(final LongWritable key, final Text value, final Context context) throws IOException, InterruptedException 
        {
            final String line = value.toString();
            final String[] data = line.trim().split("- -");
            if (data.length > 1) 
            {
                final String ipAddress = data[0];
                mapKey.set(ipAddress);
                context.write(mapKey, ONE);
            }
        }
    }

    public static final class IPReduce extends Reducer<Text, IntWritable, Text, IntWritable> 
    {

        public final void reduce(final Text key, final Iterable<IntWritable> values, final Context context) throws IOException, InterruptedException 
        {
            int count = 0, sum = 0, distinctIpCount=0;
            for (final IntWritable val : values) 
            {
                count += val.get();
                sum += count;
                distinctIpCount++;
            }
            totalHits = count;
            uniqueIP = distinctIpCount;
            context.write(key, new IntWritable(count));


        }
    }
}

1 个答案:

答案 0 :(得分:1)

关于执行MapReduce作业的一个重点是,即使您在一个类中提供了所有代码,MapReduce框架也会提取您提供的映射器和reducer类,并将它们发送到工作节点来执行,而main()方法在您开始工作的本地JVM上运行。这意味着mapper和reducer类方法不能看到你在mapper和reducer类之外定义的任何变量。

特别针对您的用例,如果您想计算所有IP地址的平均命中率,则在调用作业(-D mapred.reduce.tasks=1)时只能使用一个reducer,以便您可以定义{{1 {}}中的{}和totalHits以及所有uniqueIP次调用都将看到这些变量的相同实例。然后,您可以在减速器的IPReducer方法中计算平均值,该方法在所有reduce()完成后运行。

您将无法轻松将其发送回主程序以打印到屏幕,但您可以输出结果作为作业输出(提供相同的cleanup()对象),或者可选如果要将每IP数作为主作业输出,请使用HDFS API将平均值写入HDFS文件。