Question

我有2个格式不同的文件。一个是SequenceFileInputFormat，另一个是TextInputFormat。我知道对于Hadoop Streaming，可以指定2个输入文件，例如：

hadoop jar hadoop-streaming-2.8.0.jar \
  -input '/user/foo/dir1' -input '/user/foo/dir2' \
    (rest of the command)

但是如何为这些文件指定不同的-inputformat？

我发现可以使用MultipleInputs来实现Java，例如：

MultipleInputs.addInputPath(job, new Path(args[0]), <Input_Format_Class_1>);
MultipleInputs.addInputPath(job, new Path(args[1]), <Input_Format_Class_2>);

我可以使用Hadoop Streaming做类似的事情吗？

Answer 1

Hadoop Streaming Options包含用于hadoop流的各种选项，您可能会用到的

-inputformat JavaClassName

默认值为TextInputFormat

我仅使用TextInputFormat对它进行了测试，但我认为应该像这样

hadoop jar hadoop-streaming-2.8.0.jar \
  -input '/user/foo/dir1' -inputformat TextInputFormat \
  -input '/user/foo/dir2' -inputformat SequenceFileInputFormat \
    (rest of the command)

这是经过测试且有效的方法：

    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0*.jar \
      -file mapperB.py -mapper mapperB.py -file reducerB.py -reducer reducerB.py \
      -input /tempfiles/big.txt -inputformat TextInputFormat \
      -input /tempfiles/t.txt -inputformat TextInputFormat \
      -output /tempfiles/output-X

注意：file已弃用，

如何在Hadoop流中处理2个具有不同输入格式的文件？

1 个答案: