Question

我想通过使用Hadoop MapReduce来分析文本文件。

CVS文件更容易分析，因为它可以用'，'

区分列

但是文本文件无法像CVS文件那样区分。

这是一种文本文件格式。

2015-8-02

error2014 blahblahblahblah

2015-8-02

blahblahbalh error2014

我希望输出为

date      contents  sum of errors

2015-8-02  error2014  2

我想用这种方式分析。我应该怎么做MapReduce程序。

Answer 1

假设您有以下格式的文本文件：

2015年8月2日

error2014 blahblahblahblah

2015年8月2日

blahblahbalh error2014

您可以使用NLineInputFormat。

使用NLineInputFormat功能，您可以准确指定映射器的行数。

在您的情况下，您可以使用每个映射器输入2行。

修改：

以下是使用NLineInputFormat的示例：

Mapper类：

import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(key, value); } }

驱动程序类：

import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class Driver extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.out .printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n"); return -1; } Job job = new Job(getConf()); job.setJobName("NLineInputFormat example"); job.setJarByClass(Driver.class); job.setInputFormatClass(NLineInputFormat.class); NLineInputFormat.addInputPath(job, new Path(args[0])); job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 2); LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MapperNLine.class); job.setNumReduceTasks(0); boolean success = job.waitForCompletion(true); return success ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new Driver(), args); System.exit(exitCode); } }

然后您可以从行中提取日期和错误。在提取日期和错误之后，您可以将它们作为复合键或串联字符串作为键传递，将IntWritable作为值传递给WordCount示例，然后在reducer类agin中执行类似于WordCount示例的基本添加。

我希望我能够回答你的问题。

wordcount与文本文件

1 个答案: