Question

我有一个要在MapReduce作业中处理的XML文件。虽然我可以在未压缩时处理它，但是当我将其压缩为bz2格式并将其存储在hdfs中时它不起作用。我是否需要进行一些更改，例如指定要使用的编解码器 - 我不知道该怎么做。任何一个例子都会很棒。我正在使用mahaout中的XMLInputFormat来读取未压缩的XML文件。我使用bzip2命令压缩文件，使用hadoop dfs -copyFromLocal将文件复制到DFS。我有兴趣阅读和处理xml文档的<page></page>标记内的内容。我正在使用hadoop-1.2.1发行版。我可以看到FileOutputFormat.setOutputCompressorClass，但FileInputFormat没有类似的东西。

这是我工作的Main课程。

    public class Main extends Configured implements Tool {

        public static void main(String[] args) throws Exception {
            int res = ToolRunner.run(new Configuration(), new Main(), args);
            System.exit(res);
        }

        public int run(String[] args) throws Exception {

            if (args.length != 2) {
                System.err.println("Usage: hadoop jar XMLReaderMapRed "
                        + " [generic options] <in> <out>");
                System.out.println();
                ToolRunner.printGenericCommandUsage(System.err);
                return 1;
            }

            Job job = new Job(getConf(), "XMLTest");

            job.setInputFormatClass(MyXMLInputFormat.class);
            //Specify the start and end tag that has content
            getConf().set(MyXMLInputFormat.START_TAG_KEY, "<page>");
            getConf().set(MyXMLInputFormat.END_TAG_KEY, "</page>");

            job.setJarByClass(getClass());
            job.setMapperClass(XMLReaderMapper.class);
            job.setReducerClass(XmlReaderReducer.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);

            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));

            boolean success = job.waitForCompletion(true);
            return success ? 0 : 1;
        }
    }

编辑：从Hadoop读取 - Tom White的权威指南，提到“如果您的输入文件被压缩，它们将在mapReduce读取时自动解压缩，使用文件扩展名来确定要使用的编解码器“。所以文件会自动解压缩，但是为什么在输出目录中创建了空文件？

谢谢！

Answer 1

您应该查看 core-site.xml 配置文件，并为BZip2编解码器添加一个类（如果没有）。这是一个例子：

<property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

编辑：

添加编解码器后，请重现以下步骤以确定其有效（您的代码可能不会）：

hadoop fs -mkdir /tmp/wordcount/
echo "three one three three seven" >> /tmp/words
bzip2 -z /tmp/words
hadoop fs -put /tmp/words.bz2 /tmp/wordcount/
hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount /tmp/wordcount/ /tmp/wordcount_out/
hadoop fs -text /tmp/wordcount_out/part*
#you should see next three lines:
one     1
seven   1
three   3
#clean up
#this commands may be different in your case
hadoop fs -rmr /tmp/wordcount_out/
hadoop fs -rmr /tmp/wordcount/

Answer 2

在TextInputFormat实施中，您可能会覆盖createRecordReader并返回不考虑编解码器的RecordReader<KEYIN, VALUEIN>自定义实施。默认实现返回一个正确处理编解码器的LineRecordReader。您可以找到参考实施here，并且需要进行相关更改here。

在hadoop作业中无法读取bz2压缩文件

2 个答案: