我正在使用以下项目结构并将所有可重用类保留在 CommonPackage 模块中的 sparkcommonlib.py 中。
- README.rst
- LICENSE
- setup.py
- requirements.txt
- CommonPackage/__init__.py
- CommonPackage/sparkCommonLib.py
- CommonPackage/config.xml
- pyspark/__init__.py
- pyspark/SparkTableInsert.py
下面是在sparkCommonLib.py中为pyspark模块中不同spark应用程序设置spark会话的通用类。
class set_spark_session():
def __init__(self , appname , master=None ):
sparkenv = set_spark()
appName = appname
if master is None:
master = sparkenv.MASTER
self.batchsize = sparkenv.BATCH_SIZE
self.maxpartition = sparkenv.MAX_PART
drivers_path = os.path.normpath(sparkenv.DRIVER_PATH)
jars = os.path.join(drivers_path, "mssql-jdbc-8.4.1.jre8.jar") \
+ "," + os.path.join(drivers_path, "jconn4.jar")
self.spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", drivers_path) \
.config("spark.jars", jars) \
.appName(appName) \
.master(master).getOrCreate()
下面是pysaprk模块中pyspark/sparktableinsert.py文件中的应用代码。
import numpy as np , time , sys , os , pandas as pd
from CommonPackage import sparkCommonLib as sparkCommon
if __name__ == '__main__':
spark = sparkCommon.set_spark_session(appname="SparkTableInsert").spark
spark.sparkContext.setLogLevel("ERROR")
log = sparkCommon.common.PrintLogInfo()
## Some Spark processing steps
spark.stop()
当我使用 pip install 安装包时,应用程序运行良好,但是当我作为 spark-submit 作业运行同一个应用程序时,我收到以下错误。
spark-submit --jars ".\CommonPackage\jars\jconn4.jar,.\CommonPackage\jars\mssql-jdbc-8.4.1.jre8.jar" --py-files "dependencies.zip" ".\pyspark\SparkTableInsert.py" <Arg1> <Arg2>
Traceback (most recent call last):
File "./pyspark/SparkTableInsert.py", line 2, in <module>
**from CommonPackage import sparkCommonLib as sparkCommon**
我尝试添加使用 "python setup.py sdist" 创建的完整应用程序 egg 或 zip 文件,但结果相同。
spark-submit --jars ".\CommonPackage\jars\jconn4.jar,.\CommonPackage\jars\mssql-jdbc-8.4.1.jre8.jar" --py-files "SparkTableInsert-0.0.0.zip" ".\pyspark\SparkTableInsert.py" <arg1> <arg2> Traceback (most recent call last):
File "./pyspark/SparkTableInsert.py", line 2, in <module>
**from CommonPackage import sparkCommonLib as sparkCommon**
ModuleNotFoundError: No module named 'CommonPackage'
spark-submit --jars ".\CommonPackage\jars\jconn4.jar,.\CommonPackage\jars\mssql-jdbc-8.4.1.jre8.jar" --py-files "dependencies.zip" --archives "SparkTableInsert-0.0.0.zip" ".\pyspark\SparkTableInsert.py" <arg1> <arg2> Traceback (most recent call last):
File "./pyspark/SparkTableInsert.py", line 2, in <module>
from CommonPackage import sparkCommonLib as sparkCommon
ModuleNotFoundError: No module named 'CommonPackage'
spark-submit --jars ".\CommonPackage\jars\jconn4.jar,.\CommonPackage\jars\mssql-jdbc-8.4.1.jre8.jar" --archives "SparkTableInsert-0.0.0.zip" --py-files "dependencies.zip" ".\pyspark\SparkTableInsert.py" <arg1> <arg2> Traceback (most recent call last):
File "./pyspark/SparkTableInsert.py", line 2, in <module>
from CommonPackage import sparkCommonLib as sparkCommon
ModuleNotFoundError: No module named 'CommonPackage'
我尝试在 ./pyspark/SparkTableInsert.py 中添加以下几行,但出现同样的错误。
spark.sparkContext.addPyFile("SparkTableInsert-0.0.0.zip")
spark.sparkContext.addPyFile("dependencies.zip")
我正在本地机器上测试 spark-submit,然后将其推送到 Google Spark 作业以在 Google 集群中运行。