在YARN群集上部署pyspark作业时出现FileNotFoundException

时间:2020-02-26 15:10:56

标签: apache-spark pyspark

尝试使用以下命令在YARN群集上提交以下test.py Spark应用

PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py

注意:我没有使用本地模式,而是尝试使用用于在PyCharm中构建代码的virtualenv下的python3.7站点软件包。 virtualenv提供的自定义应用程序包未作为集群服务提供

这是Python项目结构以及venv目录内容的样子

-rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz
-rw-r--r-- 1 schakrabarti nobody      1313 Feb 26 13:07 test.py
drwxr-xr-x 6 schakrabarti nobody      4096 Feb 26 13:07 venv
drwxr-xr-x 3 schakrabarti nobody      4096 Feb 26 13:07 venv/bin
drwxr-xr-x 3 schakrabarti nobody      4096 Feb 26 13:07 venv/share
-rw-r--r-- 1 schakrabarti nobody        75 Feb 26 13:07 venv/pyvenv.cfg
drwxr-xr-x 2 schakrabarti nobody      4096 Feb 26 13:07 venv/include
drwxr-xr-x 3 schakrabarti nobody      4096 Feb 26 13:07 venv/lib

不存在与File相同的错误-pyspark.zip(如下所示)

java.io.FileNotFoundException: File does not exist: hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip

请参考我在Spark-10795上添加的评论:https://issues.apache.org/jira/browse/SPARK-10795

1 个答案:

答案 0 :(得分:0)

如果我误解了问题,我深表歉意,但根据

{
    [Activity(MainLauncher = true, NoHistory = true)]
    public class SplashScreenActivity : Activity
    {
        protected override void OnCreate(Bundle savedInstanceState)
        {
            RequestWindowFeature(WindowFeatures.NoTitle);

            base.OnCreate(savedInstanceState);

            SetContentView(Droid.Resource.Layout.SplashScreen);

            System.Threading.Thread.Sleep(3000);
            StartActivity(typeof(MainActivity));


        }

        public override void OnBackPressed() { }
     }
}


您使用的是Yarn群集,但是在您的test.py中

PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py

您尝试连接到Spark独立集群

#test.py
import json
from pyspark.sql import SparkSession

if __name__ == "__main__":
  spark = SparkSession.builder \
   .appName("Test_App") \
   .master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \
   .config("spark.ui.port", "4057") \
   .config("spark.executor.memory", "4g") \
   .getOrCreate()

  print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4))

  spark.stop()

所以,这可能是个问题