在MapReduce中读取.tar.gz文件时出现奇怪的输出

时间:2014-05-08 07:57:25

标签: hadoop mapreduce

请对我说一点容易,因为我是hadoop和MapReduce的新手。

我有一个.tar.gz文件,我试图通过编写使用CompressionCodecfactory的自定义InputFormatter来使用mapReduce读取。

我通过Internet阅读了一些文档,可以使用CompressionCodecFactory来读取.tar.gz文件。因此我在我的代码中实现了它。

我得到的输出在运行代码之后是绝对垃圾。

我的输入文件如下:

"MAY 2013          KOTZEBUE, AK"
"RALPH WIEN MEMORIAL AIRPORT (PAOT)"
"Lat:66° 52'N   Long: 162° 37'W   Elev (Ground) 30 Feet"
"Time Zone : ALASKA           WBAN: 26616    ISSN#: 0197-9833"
01,21,0,11,-11,3,11,54,0," ",4,  ,0.0,0.00,30.06,30.09,10.2,36,10.0,25,360,22,360,01
02,25,3,14,-9,5,12,51,0," ",4,  ,0.0,0.00,30.09,30.11,6.1,34,7.7,16,010,14,360,02
03,21,1,11,-12,7,11,54,0," ",4,  ,0.0,0.00,30.14,30.15,5.0,28,6.0,17,270,16,270,03
04,20,8,14,-10,11,13,51,0,"SN BR",4,  ,.001,.0001,30.09,30.11,8.6,26,9.2,20,280,15,280,04
05,29,19,24,-1,21,23,41,0,"SN BR",5,  ,0.6,0.06,30.11,30.14,8.1,20,8.5,22,240,20,240,05
06,27,19,23,-3,21,23,42,0,"SN BR",4,  ,0.1,0.01,30.14,30.15,8.7,19,9.4,18,200,15,200,06

我得到的输出很奇怪:

��@(���]�OX}�s���{Fw8OP��@ig@���e�1L'�����sAm�
��@���Q�eW�t�Ruk�@��AAB.2P�V��    \L}��+����.֏9U]N �)(���d��i(��%F�S<�ҫ  ���EN��v�7�Y�%U�>��<�p���`]ݹ�@�#����9Dˬ��M�X2�'��\R��\1-    ���V\K1�c_P▒W¨P[Ö␤ÍãÏ2¨▒;O

以下是Custom InputFormat和RecordReader代码:

InputFormat

public class SZ_inptfrmtr extends FileInputFormat<Text, Text>
{

@Override
public RecordReader<Text, Text> getRecordReader(InputSplit split,
        JobConf job_run, Reporter reporter) throws IOException {
    // TODO Auto-generated method stub
    return new SZ_recordreader(job_run, (FileSplit)split);
}

}

RecordReader:

public class SZ_recordreader implements RecordReader<Text, Text>
{
FileSplit split;
JobConf job_run;
boolean processed=false;

CompressionCodecFactory compressioncodec=null;   // A factory that will find the correct codec(.file) for a given filename.
public SZ_recordreader(JobConf job_run, FileSplit split)
{
    this.split=split;
    this.job_run=job_run;
}

@Override
public void close() throws IOException {
    // TODO Auto-generated method stub

}

@Override
public Text createKey() {
    // TODO Auto-generated method stub
    return new Text();
}

@Override
public Text createValue() {
    // TODO Auto-generated method stub
    return new Text();
}

@Override
public long getPos() throws IOException {
    // TODO Auto-generated method stub
    return processed ? split.getLength() : 0;
}

@Override
public float getProgress() throws IOException {
    // TODO Auto-generated method stub
    return processed ? 1.0f : 0.0f;
}

@Override
public boolean next(Text key, Text value) throws IOException {
    // TODO Auto-generated method stub
    FSDataInputStream in=null;
    if (!processed)
    {
        byte [] bytestream= new byte [(int) split.getLength()];
        Path path=split.getPath();
        compressioncodec=new CompressionCodecFactory(job_run);

        CompressionCodec code = compressioncodec.getCodec(path);  
        // compressioncodec will find the correct codec by visiting the path of the file and store the result in code
        System.out.println(code);

        FileSystem fs= path.getFileSystem(job_run);

        try
        {
            in =fs.open(path);
            IOUtils.readFully(in, bytestream, 0, bytestream.length);
            System.out.println("the input is " +in+ in.toString());
            key.set(path.getName());
            value.set(bytestream, 0, bytestream.length);
        }
        finally
        {
            IOUtils.closeStream(in);
        }

        processed=true;

        return true;


    }
    return false;
}

}

请有人指出这个漏洞..

2 个答案:

答案 0 :(得分:3)

.gz有一个编解码器,但没有.tar的编解码器。

您的.tar.gz正在解压缩为.tar,但它仍然是tarball,而不是Hadoop系统可以理解的内容。

答案 1 :(得分:0)

您的代码可能会停留在mapper和reducer类通信中。要在MapReduce中使用压缩文件,您需要为您的作业设置一些配置选项。这些课程 必须在驱动程序类中设置:

conf.setBoolean("mapred.output.compress", true);//Compress The Reducer Out put
conf.setBoolean("mapred.compress.map.output", true);//Compress The Mapper Output
conf.setClass("mapred.output.compression.codec",
codecClass,
CompressionCodec.class);//Compression codec for Compresing mapper output

与未压缩对比的MapReduce作业之间的唯一区别 压缩IO是这三个带注释的行。

I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.

甚至Compression编解码器也做得更好,但是有很多编解码器用于此目的,大多数是LzopCodec和SnappyCodec用于可能的大数据..你可以在这里找到LzopCodec的Git:https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/LzopCodec.java