org.apache.spark.SparkException:由于阶段失败导致作业中止:阶段任务失败,阶段任务丢失:ExecutorLostFailure(执行器4丢失)

时间:2019-03-31 14:52:28

标签: apache-spark cluster-computing

我通过此命令使用JDK 1.7和Hadoop 2.6.2构建MonoSpark(基于Spark 1.3.1)(我编辑了pom.xml,以便该命令可以工作)

./make-distribution.sh --tgz  -Phadoop-2.6 -Dhadoop.version=2.6.2

然后,我得到一个名为“ spark-1.3.1-SNAPSHOT-bin-2.6.2.tgz”的tgz文件。   我将tgz文件放在我的hadoop群集中,该群集具有一个主服务器和4个从属服务器。 然后,我使用命令启动火花。

$SPARK_HOME/sbin/start-all.sh

火花很好,因为有4名工人和1名船长。但是,当我使用spark-submit来运行示例时:

./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master spark://master:7077 lib/spark-examples-1.3.1-*-hadoop2.6.2.jar  input/README.md

我在我的驱动程序上收到此错误,如下所示

......other useless logs.....
19/03/31 22:24:41 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 2
19/03/31 22:24:46 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@slave3:55311] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:24:50 ERROR scheduler.TaskSchedulerImpl: Lost executor 3 on slave1: remote Akka client disassociated
19/03/31 22:24:54 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
.......other useless logs......
Exception in thread "main" 19/03/31 22:24:54 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 4
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave4): ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1325)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1314)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1313)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1313)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:714)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1526)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1487)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

工作节点错误日志如下:

19/03/31 22:25:11 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/2 for JavaWordCount
19/03/31 22:25:19 INFO worker.Worker: Executor app-20190331222434-0000/2 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:19 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@slave4:37919] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:19 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35254-2#299045174] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:19 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/4 for JavaWordCount
19/03/31 22:25:19 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver@master:42211/user/CoarseGrainedScheduler" "--executor-id" "4" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker@slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/4 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:32 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@slave4:60559] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:32 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35260-3#479615849] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:32 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/7 for JavaWordCount
19/03/31 22:25:32 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver@master:42211/user/CoarseGrainedScheduler" "--executor-id" "7" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker@slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Asked to kill executor app-20190331222434-0000/7
19/03/31 22:25:32 INFO worker.ExecutorRunner: Runner thread for executor app-20190331222434-0000/7 interrupted
19/03/31 22:25:32 INFO worker.ExecutorRunner: Killing process!
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/7 finished with state KILLED exitStatus 143
19/03/31 22:25:32 INFO worker.Worker: Cleaning up local directories for application app-20190331222434-0000

关于hadoop版本是否存在任何错误?也许我使用了错误的hadoop版本或jdk版本来构建Spark。   希望有人能给我一些建议,谢谢。

1 个答案:

答案 0 :(得分:0)

我在执行程序中发现一些错误:

java.lang.UnsupportedOperationException: Datanode-side support for getVolumeBlockLocations() must also be enabled in the client configuration.

我在hadoop-site.xml中将dfs.datanode.hdfs-blocks-metadata.enabled设置为true,然后重新启动hadoop集群。最后,它对我有用。

执行器的错误日志在目录中:工作

cd $SPARK_HOME/work/appxxxx/xx(xx is a number)
相关问题