Question

我正在从PIG（v0.15.0）脚本生成一些镶木地板（v1.6.0）输出。我的脚本需要几个输入源，并将它们与一些嵌套连接起来。该脚本运行时没有错误，但在SUM(Sheet2:Sheet40!Z3)操作期间我得到：

STORE

当我使用2016-04-19 17:24:36,299 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=FAILED, progress=TotalTasks: 249 Succeeded: 220 Running: 0 Failed: 1 Killed: 28 FailedTaskAttempts: 43, diagnostics=Vertex failed, vertexName=scope-1446, vertexId=vertex_1460657535752_15030_1_18, diagnostics=[Task failed, taskId=task_1460657535752_15030_1_18_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:parquet.hadoop.MemoryManager$1: New Memory allocation 134217728 exceeds minimum allocation size 1048576 with largest schema having 132 columns at parquet.hadoop.MemoryManager.updateAllocation(MemoryManager.java:125) at parquet.hadoop.MemoryManager.addWriter(MemoryManager.java:82) at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:104) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:309) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:81) at org.apache.tez.mapreduce.output.MROutput.initialize(MROutput.java:398) ...执行脚本时抛出了上述异常，但在使用mapreduce时我得到了相同的异常。我尝试使用-x tez增加并行化，并在我的SET default_parallel操作之前添加（不需要我的真实目标）ORDER BY操作，以确保PIG有机会发送数据对不同的减速器和最小化任何给定减速器所需的内存。最后，我尝试使用STORE推高可用内存。然而，这些都没有帮助。

我有什么东西不见了吗？有没有已知的策略可以避免一个减速器带来过多的负载并导致写入期间失败？我在写入avro输出时遇到了类似的问题，这些问题似乎是由于内存不足而无法执行压缩步骤。

编辑：根据this source file，问题似乎归结为SET mapred.child.java.opts。但是，内存分配似乎不受我尝试的memAllocation/nCols<minMemAllocation设置的影响。

Answer 1

我最后使用参数parquet.block.size解决了这个问题。默认值（参见source）足以写入128列宽的文件，但不会更大。猪的解决方案是使用SET parquet.block.size x;，其中x >= y * 1024^2和y是输出中的列数。

如何避免Parquet MemoryManager异常

1 个答案: