实木复合地板中的行组数量超出预期

时间:2017-10-27 22:45:46

标签: hadoop mapreduce parquet

我目前正在用map reduce编写镶木地板,我将行组大小配置为256M,hdfs块大小也是256M。每个文件的输出文件大小约为1G。

所以我应该在生成的文件中预期有4个行组。但是当我使用:

parquet-tools meta path/to/my/file | grep "row group"

它为我提供了63个具有不同大小和行数的行组:

row group 1:                      RC:69816 TS:244168913
row group 2:                      RC:35111 TS:117407826
row group 3:                      RC:18488 TS:60107388
row group 4:                      RC:10357 TS:33260415
row group 5:                      RC:7905 TS:24956045
row group 6:                      RC:4754 TS:15149122
row group 7:                      RC:3862 TS:12476651
row group 8:                      RC:2738 TS:9001631
row group 9:                      RC:2104 TS:7120040
row group 10:                     RC:1910 TS:6398391
row group 11:                     RC:1508 TS:5219072
row group 12:                     RC:1386 TS:4676154
row group 13:                     RC:1124 TS:3950635
row group 14:                     RC:999 TS:3518545
row group 15:                     RC:865 TS:3121657
row group 16:                     RC:774 TS:2801614
row group 17:                     RC:678 TS:2490904
row group 18:                     RC:511 TS:1996167
row group 19:                     RC:69808 TS:243894989
row group 20:                     RC:30176 TS:99585195
row group 21:                     RC:20678 TS:67779524
row group 22:                     RC:10743 TS:34547874
row group 23:                     RC:8258 TS:26080110
row group 24:                     RC:5227 TS:16456577
row group 25:                     RC:4136 TS:13321721
row group 26:                     RC:3207 TS:10272043
row group 27:                     RC:2437 TS:8107932
row group 28:                     RC:1945 TS:6563867
row group 29:                     RC:1561 TS:5320028
row group 30:                     RC:1389 TS:4809485
row group 31:                     RC:1206 TS:4251584
row group 32:                     RC:996 TS:3581746
row group 33:                     RC:895 TS:3203224
row group 34:                     RC:757 TS:2869939
row group 35:                     RC:653 TS:2550716
row group 36:                     RC:531 TS:2008746
row group 37:                     RC:69706 TS:244420245
row group 38:                     RC:32703 TS:109391929
row group 39:                     RC:18640 TS:60918458
row group 40:                     RC:10737 TS:34272225
row group 41:                     RC:7812 TS:24814707
row group 42:                     RC:5176 TS:16206655
row group 43:                     RC:4123 TS:13224377
row group 44:                     RC:3391 TS:10946649
row group 45:                     RC:2138 TS:7248145
row group 46:                     RC:1960 TS:6566944
row group 47:                     RC:1538 TS:5294523
row group 48:                     RC:1355 TS:4744634
row group 49:                     RC:1225 TS:4194625
row group 50:                     RC:1026 TS:3587484
row group 51:                     RC:877 TS:3134267
row group 52:                     RC:785 TS:2846718
row group 53:                     RC:675 TS:2546836
row group 54:                     RC:538 TS:2016450
row group 55:                     RC:69762 TS:244915809
row group 56:                     RC:32390 TS:108310300
row group 57:                     RC:18095 TS:58754777
row group 58:                     RC:10759 TS:34405301
row group 59:                     RC:8195 TS:26029310
row group 60:                     RC:5286 TS:16597963
row group 61:                     RC:4231 TS:13415076
row group 62:                     RC:3538 TS:11465640
row group 63:                     RC:135 TS:688850

行组有一个递归模式,任何人都知道为什么镶木地板不符合我配置的行组大小(256M)?

1 个答案:

答案 0 :(得分:0)

使用Parquet-MR编写Parquet文件时,这是一个未解决的问题。该算法未考虑压缩,因此创建的行组比预期的多。

您可以在此处找到有关它的更多信息: https://issues.apache.org/jira/browse/PARQUET-1337