切换到EMRFS一致视图后,为什么看到镶木地板写入错误?

时间:2019-02-26 14:38:12

标签: apache-spark amazon-s3 pyspark amazon-emr

我们在EMR集群上运行着一个大型ETL流程,该流程可将大量镶木地板文件读写到S3存储桶中

代码如下:

a = spark.read.parquet(path1)
a.registerTempTable('a')
b = spark.read.parquet(path2)
b.registerTempTable('b')
c = spark.read.parquet(path3)
c.registerTempTable('c')

sql = '''
select
a.col1,
a.col2,
b.col1,
b.col2,
c.col1,
c.col2,
a.dt
from
a
join 
b
on
a.dt = b.dt
join
c
on
a.dt = c.dt
''''

df_out = spark.sql(sql)

df_out.repartition('dt').write.parquet( path_out, partitionBy='dt', mode='overwrite')

我们最近不得不切换到瞬态群集,因此不得不开始使用一致的视图。我在下面放置我们的ERMFS网站设置:

{
"fs.s3.enableServerSideEncryption": "true",
"fs.s3.consistent": "false",
"fs.s3.consistent.retryPeriodSeconds": "10",
"fs.s3.serverSideEncryption.kms.keyId": "xxxxxx",
"fs.s3.consistent.retryCount": "5",
"fs.s3.consistent.metadata.tableName": "xxxxx",
"fs.s3.consistent.throwExceptionOnInconsistency": "true"
}

在相同的火花配置上运行相同的代码-在永久群集上运行-在启用一致视图的瞬态群集上运行会导致错误。

...
19/02/25 23:01:23 DEBUG S3NativeFileSystem: getFileStatus could not find key 'xxxxxREDACTEDxxxx'
19/02/25 23:01:23 DEBUG S3NativeFileSystem: Delete called for 'xxxxxREDACTEDxxxx' but file does not exist, so returning false
19/02/25 23:01:23 DEBUG DFSClient: DFSClient writeChunk allocating new packet seqno=465, src=/var/log/spark/apps/application_1551126537652_0003.inprogress, packetSize=65016, chunksPerPacket=126, bytesCurBlock=25074688
19/02/25 23:01:23 DEBUG DFSClient: DFSClient flush(): bytesCurBlock=25081892 lastFlushOffset=25075161 createNewBlock=false
19/02/25 23:01:23 DEBUG DFSClient: Queued packet 465
19/02/25 23:01:23 DEBUG DFSClient: Waiting for ack for: 465
19/02/25 23:01:23 DEBUG DFSClient: DataStreamer block BP-75703405-10.13.32.237-1551126523840:blk_1073741876_1052 sending packet packet seqno: 465 offsetInBlock: 25074688 lastPacketInBlock: false lastByteOffsetInBlock: 25081892
19/02/25 23:01:23 DEBUG DFSClient: DFSClient seqno: 465 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
Traceback (most recent call last):
File "xxxxxREDACTEDxxxx", line 112, in <module>
main()
File "xxxxxREDACTEDxxxx", line xxxxxREDACTEDxxxx, in main
xxxxxREDACTEDxxxx
File "xxxxxREDACTEDxxxx", line 70, in main
partitionBy='dt', mode='overwrite')
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 691, in parquet
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o232.parquet.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)

我怀疑该错误是由于EMRFS的设置引起的,但是我找不到任何有效的EMRFS设置。导致我在运行上面的代码时未引发此错误的唯一事情是将节点数增加到正常数的两倍。如果减少数据量,也不会引发错误。

更改输出通勤者和火花推测也无济于事。

非常感谢。对于任何想法/建议,我都会非常满意。

1 个答案:

答案 0 :(得分:0)

“ fs.s3.consistent”:为使emrfs一致视图正常工作,“ false”应为true