snappy文件上的hadoop python作业产生0大小输出

时间:2015-11-11 12:03:12

标签: hadoop hadoop-streaming mrjob

当我在文本文件上使用hadoop流运行wordcount.py(python mrjob http://mrjob.readthedocs.org/en/latest/guides/quickstart.html#writing-your-first-job)时,它会给我输出,但是当对.snappy文件运行时,我的输出为零。

已尝试的选项:

[testgen word_count]# cat mrjob.conf 
runners:
  hadoop: # this will work for both hadoop and emr
    jobconf:
      mapreduce.task.timeout: 3600000
      #mapreduce.max.split.size: 20971520
      #mapreduce.input.fileinputformat.split.maxsize: 102400
      #mapreduce.map.memory.mb: 8192
      mapred.map.child.java.opts: -Xmx4294967296
      mapred.child.java.opts: -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/

      java.library.path: /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/

      # "true" must be a string argument, not a boolean! (#323)
      #mapreduce.output.compress: "true"
      #mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec

[testgen word_count]# 

命令:

[testgen word_count]# python word_count2.py -r hadoop hdfs:///input.snappy --conf mrjob.conf 
creating tmp directory /tmp/word_count2.root.20151111.113113.369549
writing wrapper script to /tmp/word_count2.root.20151111.113113.369549/setup-wrapper.sh
Using Hadoop version 2.5.0
Copying local files into hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/files/

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Detected hadoop configuration property names that do not match hadoop version 2.5.0:
The have been translated as follows
 mapred.map.child.java.opts: mapreduce.map.java.opts
HADOOP: packageJobJar: [/tmp/hadoop-root/hadoop-unjar3623089386341942955/] [] /tmp/streamjob3671127555730955887.jar tmpDir=null
HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
HADOOP: Running job: job_201511021537_70340
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH//bin/hadoop job  -Dmapred.job.tracker=logicaljt -kill job_201511021537_70340
HADOOP: Tracking URL: http://xxxxx_70340
HADOOP:  map 0%  reduce 0%
HADOOP:  map 100%  reduce 0%
HADOOP:  map 100%  reduce 11%
HADOOP:  map 100%  reduce 97%
HADOOP:  map 100%  reduce 100%
HADOOP: Job complete: job_201511021537_70340
HADOOP: Output: hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
Counters from step 1:
  (no counters found)
Streaming final output from hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output

removing tmp directory /tmp/word_count2.root.20151111.113113.369549
deleting hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549 from HDFS
[testgen word_count]# 

没有抛出错误,作业输出成功,已经在作业统计中验证了作业配置。

还有其他方法可以排除故障吗?

2 个答案:

答案 0 :(得分:1)

我认为你没有使用正确的选项。

mrjob.conf文件中:

  1. mapreduce.output.compress:" true" 表示您需要压缩输出
  2. mapreduce.output.compression.codec:org.apache.hadoop.io.compress.SnappyCodec 表示压缩使用Snappy编解码器
  3. 您显然希望映射器能够正确读取您的压缩输入。不幸的是,它并不像那样。如果您真的想用压缩数据提供作业,可以查看SequenceFile。另一个更简单的解决方案是仅为文本文件提供作业。

    如何配置输入格式,例如mapreduce.input.compression.codec: org.apache.hadoop.io.compress.SnappyCodec

    [编辑:您还应该在定义选项的行的开头删除此符号#。否则,它们将被忽略]

答案 1 :(得分:0)

感谢您的输入Yann,但最后插入作业脚本的下面一行解决了问题。

HADOOP_INPUT_FORMAT='<org.hadoop.snappy.codec>'