使用mapreduce从日志文件中提取计数

时间:2015-03-12 09:16:19

标签: url hadoop logging text mapreduce

我正在尝试在Hadoop map-reduce中编写以下代码。我有一个日志文件,其中包含IP地址和由其后面的相应IP打开的URL。它如下:

192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com

现在,我需要以这样的方式组织此文件的结果:它列出了不同的IP地址和Url,后跟该IP打开的特定次数。

例如,如果192.168.72.224按照整个日志文件打开www.yahoo.com 15次,则输出必须包含:

192.168.72.224 www.yahoo.com 15

应该对文件中的所有IP执行此操作,最终输出应如下所示:

192.168.72.224 www.yahoo.com 15
               www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
               www.gmail.com 19
....
...
..
.

我尝试的代码是:

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
            private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
                 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
                       String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);

                     while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
                              output.collect(word, one);
            }
       }
}

我知道这段代码存在严重缺陷,请建议我继续前进。

谢谢。

2 个答案:

答案 0 :(得分:1)

我会提出这个设计:

  1. Mapper从文件中获取一行,并输出IP作为键和一对网站,输出1作为值
  2. Combiner和Reducer。获取IP作为键和一系列(网站,计数)对,通过网站聚合它们(使用HashMap)并输出IP,网站和计数作为输出。
  3. 实现这一点需要您实现自定义可写以处理一对。

    我个人会用Spark做这个,除非你太关注性能。使用PySpark,它就像这样简单:

    rdd = sc.textFile('/sparkdemo/log.txt')
    counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
    result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
    for x in result:
        print 'IP: %s' % x[0]
        for w in x[1]:
            print '    website: %s count: %d' % (w[0], w[1])
    

    您的示例的输出将是:

    IP: 192.168.72.224
        website: www.facebook.com count: 2
        website: www.m4maths.com count: 2
        website: www.google.com count: 5
        website: www.gmail.com count: 4
        website: www.indiabix.com count: 8
        website: www.yahoo.com count: 3
    IP: 192.168.72.177
        website: www.yahoo.com count: 14
        website: www.google.com count: 3
        website: www.facebook.com count: 3
        website: www.m4maths.com count: 3
        website: www.indiabix.com count: 1
    IP: 192.168.198.92
        website: www.facebook.com count: 4
        website: www.m4maths.com count: 3
        website: www.yahoo.com count: 3
        website: www.askubuntu.com count: 2
        website: www.indiabix.com count: 1
        website: www.google.com count: 5
        website: www.gmail.com count: 1
    

答案 1 :(得分:1)

我在java中编写了相同的逻辑

public class UrlHitMapper extends Mapper<Object, Text, Text, Text>{

    public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {

        System.out.println(value);
        StringTokenizer st=new StringTokenizer(value.toString());

        if(st.hasMoreTokens())
            contex.write(new Text(st.nextToken()), new Text(st.nextToken()));

    }
}

public class UrlHitReducer extends Reducer<Text, Text, Text, Text>{

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {

        HashMap<String, Integer> urlCount=new HashMap<>();
        String url=null;

        Iterator<Text> it=values.iterator();

        while (it.hasNext()) {

            url=it.next().toString();

            if(urlCount.get(url)==null)
                urlCount.put(url, 1);
            else
                urlCount.put(url,urlCount.get(url)+1);
        }

        for(Entry<String, Integer> k:urlCount.entrySet())
        context.write(key, new Text(k.getKey()+"    "+k.getValue()));
    }
}

public class UrlHitCount extends Configured implements Tool {

    public static void main(String[] args) throws Exception {

        ToolRunner.run(new Configuration(), new UrlHitCount(), args);
    }

    public int run(String[] arg0) throws Exception {


        Job job = Job.getInstance(getConf());

        job.setJobName("url-hit-count");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(UrlHitMapper.class);

        job.setReducerClass(UrlHitReducer.class);  

        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path("input/urls"));
        FileOutputFormat.setOutputPath(job, new Path("url_otput"+System.currentTimeMillis()));

        job.setJarByClass(WordCount.class);
        job.submit();

        return 1;
    }

}