Question

我正在研究一个三节点Hadoop mapreduce问题，该问题旨在获取200,000行input.csv文件，其中日期和点值作为标题（25行样本数据的要点：https://gist.githubusercontent.com/PatMulvihill/63effd90411efe858330b54a4111fadb/raw/4033695ba5ca2f439cfd1512358425643807d83b/input.csv）。程序应找到任何不是以下值的Point Value：200, 400, 600, 800, 1000, 1200, 1600, or 2000。那个点值应该是值。密钥应该是该点值之前的值中的日期。例如，如果我们有数据2000-05-25,400 2001-10-12, 650 2001-04-09, 700应该发送到reducer的键值对是<2001, 650>和<2001, 700>。然后，reducer应该取每个给定年份中所有值的平均值，并将这些键值对写入我指定的hdfs /out路径。该程序编译良好，但从未实际写入任何输出。我想知道为什么以及我能做些什么来解决它。完整代码如下：

import java.io.IOException;
import java.util.Arrays;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class JeopardyMR {

public static class SplitterMapper extends Mapper <Object, Text, Text, IntWritable> {

    public void map (Object key, Text value, Context context) throws IOException, InterruptedException {
        // Convert the CSVString (of type Text) to a string
        String CSVString = value.toString();
        // Split the string at each comma, creating an ArrayList with the different attributes in each index.
        // Sometimes the questions will be split into multiple elements because they contain commas, but for the
        // way we will be parsing the CSV's, it doesn't matter.
        List<String> items = Arrays.asList(CSVString.split("\\s*,\\s*"));
        // Loop through all the elements in the CSV
        // Start i at 3 to ensure that you do not parse a point value that has a year absent from the data set.
        // We can end the loop at items.size() w/o truncating the last 3 items because if we have a point value, we know
        // that the corresponding year is in the items before it, not after it.
        // We will miss 1 or 2 data points because of this, but it shouldn't matter too much because of the magnitude of our data set
        // and the fact that a value has a low probability of actually being a daily double wager.
        for (int i = 3; i < items.size(); i++) {
            // We want a String version of the item that is being evaluated so that we can see if it matches the regex
            String item = items.get(i);
            if (item.matches("^\\d{4}\\-(0?[1-9]|1[012])\\-(0?[1-9]|[12][0-9]|3[01])$")) {
                // Make sure that we don't get an out of bounds error when trying to access the next item
                if (i + 1 >= items.size()) {
                    break;
                } else {
                    // the wagerStr should always be the item after a valid air date
                    String wagerStr = items.get(i + 1);
                    int wager = Integer.parseInt(wagerStr);
                    // if a wager isn't the following values, assume that is a daily double wager
                    if (wager != 200 && wager != 400 && wager != 600 && wager != 800 && wager != 1000 && wager != 1200 && wager != 1600 && wager != 2000) {
                        // if we know that a point value of a question is in fact a daily double wager, find the year that the daily double happened
                        // the year will always be the first 4 digits of a valid date formatted YYYY-MM-DD
                        char[] airDateChars = item.toCharArray();
                        String year = "" + airDateChars[0] + airDateChars[1] + airDateChars[2] + airDateChars[3];

                        // output the follow key-value pair: <year, wager>
                        context.write(new Text(year), new IntWritable(wager));
                    }
                }

            }
        }
    }
}

public static class IntSumReducer extends Reducer <Text, IntWritable, Text, IntWritable> {

    private IntWritable result = new IntWritable();
    public void reduce (Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0, count = 0;
        for (IntWritable val : values) {
            sum += val.get();
            count++;
        }
        int avg = sum / count;
        result.set(avg);
        context.write(key, result);
    }
}

public static void main (String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "jeopardy daily double wagers by year");
    job.setJarByClass(JeopardyMR.class);
    job.setMapperClass(SplitterMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

可以在此处找到成功的编译终端输出：https://gist.github.com/PatMulvihill/40b3207fe8af8de0b91afde61305b187 我是Hadoop map-reduce的新手，我可能犯了一个非常愚蠢的错误。我将此代码基于此处的代码：https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html 如果我遗漏任何有用的信息，请告诉我。任何帮助，将不胜感激！谢谢。

Answer 1

我检查并认为items.size（）是两个。如您所知，地图输入是每行的文件行和地图任务执行地图功能。一旦每个行以分号分割，项目的大小变为2，接下来当项目大小大于3时执行。您可以检查映射输出写入字节以查看是否写入任何数据。编辑：用以下代码替换地图代码：

public void map (Object key, Text value, Context context) throws IOException, InterruptedException {
        String CSVString = value.toString();
        String[] yearsValue =  CSVString.split("\\s*,\\s*");
        if(yearsValue.length == 2){
            int wager = Integer.parseInt(yearsValue[1]);
            if (wager != 200 && wager != 400 && wager != 600 && wager != 800 && wager != 1000 && wager != 1200 && wager != 1600 && wager != 2000) {
                char[] airDateChars = yearsValue[0].toCharArray();
                String year = "" + airDateChars[0] + airDateChars[1] + airDateChars[2] + airDateChars[3];
                context.write(new Text(year), new IntWritable(wager));

            }
        }else{
            System.out.println(CSVString);
        }
}

Answer 2

我实际上是通过将.csv文件转换为.txt文件来解决此问题。这不是解决问题的真正解决方案，但这正是使我的代码工作的原因，现在我可以继续理解为什么这是一个问题。此外，这可能有助于将来的某个人！

Hadoop map-reducer没有写任何输出

2 个答案: