导出到S3时AWS EMR上的错误:找不到类com.amazon.ws.emr.hadoop.fs.EmrFileSystem

时间:2018-08-02 06:28:46

标签: pyspark amazon-emr

我正在尝试将数据从EMR主节点导出到S3存储桶,但失败了。 从我的pyspark代码执行以下代码行时:

DF1
.coalesce(1)
.write
.format("csv")
.option("header","true")
.save("s3://fittech-bucket/emr/outputs/test_data")

出现以下错误:

An error occurred while calling o78.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
        at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:452)
        at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:548)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:278)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)

1 个答案:

答案 0 :(得分:0)

尝试直接写入本地HDFS文件系统,然后使用aws s3 cp将本地文件复制到S3。另外,您可以启用EMRFS并使用同步,以便它自动将本地更改推送到S3。有关EMRFS参考,请参见https://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-cli-reference.html。 这可能是一种解决方法,但它应该可以解决您的主要问题。另外,如果使用EMRFS,您将获得许多好处。如果您想从Python内部执行EMRFS sync命令,因为我不确定是否有办法从boto3中执行此操作,因此可以通过从Python执行bash命令来做到这一点,例如:Running Bash commands in Python

如果您只想使用boto3将文件推送到S3,则此处记录了通过Boto3将文件上传到S3的信息:https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-creating-buckets.html

您还可以使用s3-dist-cphadoop fs复制到S3或从S3复制,如此处所述:How does EMR handle an s3 bucket for input and output?

相关问题