我正在EMR上运行Spark 2.3,并尝试使用Scala将数据写入HDFS,如下所示:
dataframe.write.
partitionBy("column1").
bucketBy(1,"column2").
sortBy("column2").
mode("overwrite").
format("parquet").
option("path","hdfs:///destination/").
saveAsTable("result")
一旦数据被写入并且任务完成,我将收到超时错误。错误发生后,我可以看到HDFS中的数据已完全处理。
为什么会出现此错误?这有什么意思吗?
主节点似乎正在尝试与另一个IP(与任何节点IP不匹配)进行通信,但是数据已经在HDFS中。
请注意,仅在使用.save("hdfs:///location/")
方法时,使用.save("s3://bucket/folder/")
或saveAsTable
不会发生这种情况。我需要使用saveAsTable
来进行存储和排序。
下面的错误日志摘要
18/07/23 16:33:31 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`result` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
18/07/23 16:35:32 ERROR log: Got exception: org.apache.hadoop.net.ConnectTimeoutException Call From ip-master_node_ip/master.node.ip to other_ip.ec2.internal:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-master_node_ip/master.node.ip to other_ip.ec2.internal:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 110 more
18/07/23 16:35:32 ERROR log: Converting exception to MetaException
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-master_node_ip/master.node.ip to other_ip.ec2.internal:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
... 49 elided
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=other_ip.ec2.internal/other_ip:8020]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
作为参考,我尝试了发布的解决方法here,但是在路径hdfs:///master_node_ip:8020/location/")
中指定主节点IP时仍然出现错误。
答案 0 :(得分:0)
如果默认情况下,您的EMR集群使用的是Glue MetaStore,并且该数据库不存在,那么您将看到该超时。您可以按照建议删除配置或创建数据库
Classification: hive-site
Property: hive.metastore.client.factory.class
Value: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
Source: Cluster configuration