为什么Spark包解析器(`--packages`)不会将依赖项复制到$ SPARK_HOME / jars?

时间:2020-11-09 07:04:25

标签: apache-spark hadoop spark-submit spark-shell

有人可以向我解释为什么我在com.amazonaws_aws-java-sdk-bundle上使用自动程序包解析器,为什么我必须手动将--packages复制到本地$ SPARK_HOME吗?

我所做的是从spark-shell开始进行spark提交:

$SPARK_HOME/bin/spark-shell \
  --master k8s://https://localhost:6443  \
  --deploy-mode client  \
  --conf spark.executor.instances=1  \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark  \
  --conf spark.kubernetes.container.image=spark:spark-docker  \
  --packages org.apache.hadoop:hadoop-aws:3.2.0,io.delta:delta-core_2.12:0.7.0 \
  --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
  --conf spark.hadoop.fs.path.style.access=true \
  --conf spark.hadoop.fs.s3a.access.key=$MINIO_ACCESS_KEY \
  --conf spark.hadoop.fs.s3a.secret.key=$MINIO_SECRET_KEY \
  --conf spark.hadoop.fs.s3a.endpoint=$MINIO_ENDPOINT \
  --conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
  --conf spark.hadoop.fs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.driver.port=4040 \
  --name spark-locally

我的设置是最新的Spark 3.0.1和Hadoop 3.2 here,以及本地Kubernetes和Docker Desktop for Mac。

上面所说的--packages org.apache.hadoop:hadoop-aws:3.2.0可以成功下载依赖项,它具有com.amazonaws_aws-java-sdk-bundle-1.11.375作为依赖项:

Ivy Default Cache set to: /Users/sspaeti/.ivy2/cache
The jars for the packages stored in: /Users/sspaeti/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sspaeti/Documents/spark/spark-3.0.1-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-91fd31e1-0b2a-448c-9c69-fd9dc430d41c;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;3.2.0 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.375 in central
    found io.delta#delta-core_2.12;0.7.0 in central
    found org.antlr#antlr4;4.7 in central
    found org.antlr#antlr4-runtime;4.7 in central
    found org.antlr#antlr-runtime;3.5.2 in central
    found org.antlr#ST4;4.0.8 in central
    found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
    found org.glassfish#javax.json;1.0.4 in central
    found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 376ms :: artifacts dl 22ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.375 from central in [default]
    com.ibm.icu#icu4j;58.2 from central in [default]
    io.delta#delta-core_2.12;0.7.0 from central in [default]
    org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
    org.antlr#ST4;4.0.8 from central in [default]
    org.antlr#antlr-runtime;3.5.2 from central in [default]
    org.antlr#antlr4;4.7 from central in [default]
    org.antlr#antlr4-runtime;4.7 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.0 from central in [default]
    org.glassfish#javax.json;1.0.4 from central in [default]

但是为什么然后,我总是收到java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException的错误here。我不了解,当我进入deploy-mode client时,我以为Maven会解决对本地Spark(驱动程序)的所有依赖关系,不是吗?还是缺少的拼图在哪里?

我也尝试过--packages org.apache.hadoop:hadoop-aws:3.2.0,io.delta:delta-core_2.12:0.7.0,com.amazonaws:aws-java-sdk-bundle:1.11.375也没有运气。

我的解决方案,但不知道为什么要这么做

起作用的是我手动复制(从maven复制),或者直接从下载的.ivy2文件夹复制,如下所示:

cp $HOME/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.375.jar $SPARK_HOME/jars 之后,我可以成功读写本地S3(微型)。

与Jupyter一起使用

另一个奇怪的事情是,我还在本地Kubernetes上安装了Jupyter Notebook,在那里它可以与普通的--packages一起使用。在那里,我使用pyspark,区别在于pyspark可以工作,但不在spark-shell上吗?

如果是这样,我将如何在pyspark上而不是spark-shell上进行相同的测试?

非常感谢您的解释,我已经为此浪费了很多时间。

0 个答案:

没有答案
相关问题