Question

我需要在Hadoop作业中处理和操作许多图像，输入将通过网络进行，使用MultiThreadedMapper进行缓慢下载。

但减少输出的最佳方法是什么？我想我应该将原始二进制图像数据写入序列文件，将这些文件传输到最终的家中，然后编写一个小应用程序将单个图像从SequenceFile提取到单个JPG和GIF中。

或者有更好的选择吗？

Answer 1

如果您对此感到满意（或者通过某些Google处理可以找到实现），您可以编写一个FileOutputFormat，它使用ZipOutputStream包装FSDataOutputStream，为每个reducer提供一个Zip文件（从而为您节省编写seq文件提取程序的工作。

不要因为编写自己的OutputFormat而感到沮丧，它实际上并不困难（并且比编写自定义的InputFormats更容易，后者需要担心拆分）。实际上这是一个起点 - 你只需要实现write方法：

// Key: Text (path of the file in the output zip)
// Value: BytesWritable - binary content of the image to save
public class ZipFileOutputFormat extends FileOutputFormat<Text, BytesWritable> {
    @Override
    public RecordWriter<Text, BytesWritable> getRecordWriter(
            TaskAttemptContext job) throws IOException, InterruptedException {
        Path file = getDefaultWorkFile(job, ".zip");

        FileSystem fs = file.getFileSystem(job.getConfiguration());

        return new ZipRecordWriter(fs.create(file, false));
    }

    public static class ZipRecordWriter extends
            RecordWriter<Text, BytesWritable> {
        protected ZipOutputStream zos;

        public ZipRecordWriter(FSDataOutputStream os) {
            zos = new ZipOutputStream(os);
        }

        @Override
        public void write(Text key, BytesWritable value) throws IOException,
                InterruptedException {
            // TODO: create new ZipEntry & add to the ZipOutputStream (zos)
        }

        @Override
        public void close(TaskAttemptContext context) throws IOException,
                InterruptedException {
            zos.close();
        }
    }
}

Hadoop方法输出数百万个小二进制/图像文件

1 个答案: