在纱线上运行Spark应用程序时出现问题

时间:2017-11-10 10:33:50

标签: apache-spark pyspark

我在AWS上运行测试火花环境(单节点)。我在PySpark shell中执行了一些adhoc查询,一切都按预期进行,但是,当我使用 spark-submit 运行应用程序时,我收到错误。

以下是代码:

from __future__ import print_function
from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext as sql

conf = SparkConf().setAppName("myapp")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
if __name__ == "__main__":
    #inp_data = loaded data from db
    df = inp_data.select('Id','DueDate','Principal','delay','unpaid_emi','future_payment')
    filterd_unpaid_emi = df.filter(df.unpaid_emi == 1)
    par = filterd_unpaid_emi.groupBy('Id').sum('Principal').withColumnRenamed('sum(Principal)' , 'par')
    temp_df = df.filter(df.unpaid_emi == 1)
    temp_df_1 = temp_df.filter(temp_df.future_payment == 0)
    temp_df_1.registerTempTable("mytable")
    bucket_df_1 = sql("""select *, case
    when delay<0 and delay ==0 then '9999'
    when delay>0 and delay<7 then '9'
    when delay>=7 and delay<=14 then '8'
    when delay>=15 and delay<=29 then '7'
    when delay>=30 and delay<=59 then '6'
    when delay>=60 and delay<=89 then '5'
    when delay>=90 and delay<=119 then '4'
    when delay>=120 and delay<=149 then '3'
    when delay>=150 and delay<=179 then '2'
    else '1'
    end as bucket
    from mytable""")
    bucket_df_1 = bucket_df_1.select(bucket_df_1.Id,bucket_df_1.Principal,bucket_df_1.delay,bucket_df_1.unpaid_emi,bucket_df_1.future_payment,bucket_df_1.bucket.cast("int").alias('buckets'))
    min_bucket = bucket_df_1.groupBy('Id').min('buckets').withColumnRenamed('min(buckets)' , 'max_delay')
    joinedDf = par.join(min_bucket, ["Id"])
    #joinedDf.printSchema()

以下是提交申请的命令:

spark-submit \
--master yarn \
--driver-class-path /path to/mysql-connector-java-5.0.8-bin.jar \
--jars /path to/mysql-connector-java-5.0.8-bin.jar \
/path to/mycode.py  

错误:

17/11/10 10:00:34 INFO SparkSqlParser: Parsing command: mytable
Traceback (most recent call last):
  File "/path to/mycode.py", line 36, in <module>
    from mytable""")
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 73, in __init__
AttributeError: 'str' object has no attribute '_jsc'
17/11/10 10:00:34 INFO SparkContext: Invoking stop() from shutdown hook
17/11/10 10:00:34 INFO SparkUI: Stopped Spark web UI at ........

我对Spark很陌生,所以有人可以告诉我正在做的错误吗? 此外,任何有关改进编码风格的反馈将不胜感激!

Spark版本:2.2

1 个答案:

答案 0 :(得分:1)

您正在使用导入的SQLContext作为sql来查询临时表(未绑定到任何spark实例),而不是spark.sql(来自初始化的spark实例)。我也改变了一些你的导入和代码。

from __future__ import print_function
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession



if __name__ == "__main__":
    # move the initializations within the main
    conf = SparkConf().setAppName("myapp")
    # create the session
    spark = SparkSession.builder.config(conf=conf) \
           .getOrCreate()

    # load your data and do what you need to do
    #inp_data = loaded data from db
    df = inp_data.select('Id','DueDate','Principal','delay','unpaid_emi','future_payment')
    filterd_unpaid_emi = df.filter(df.unpaid_emi == 1)
    par = filterd_unpaid_emi.groupBy('Id').sum('Principal').withColumnRenamed('sum(Principal)' , 'par')
    temp_df = df.filter(df.unpaid_emi == 1)
    temp_df_1 = temp_df.filter(temp_df.future_payment == 0)
    temp_df_1.registerTempTable("mytable")

    # use spark.sql to query your table
    bucket_df_1 = spark.sql("""select *, case
    when delay<0 and delay ==0 then '9999'
    when delay>0 and delay<7 then '9'
    when delay>=7 and delay<=14 then '8'
    when delay>=15 and delay<=29 then '7'
    when delay>=30 and delay<=59 then '6'
    when delay>=60 and delay<=89 then '5'
    when delay>=90 and delay<=119 then '4'
    when delay>=120 and delay<=149 then '3'
    when delay>=150 and delay<=179 then '2'
    else '1'
    end as bucket
    from mytable""")

    bucket_df_1 = bucket_df_1.select(bucket_df_1.Id,bucket_df_1.Principal,bucket_df_1.delay,bucket_df_1.unpaid_emi,bucket_df_1.future_payment,bucket_df_1.bucket.cast("int").alias('buckets'))
    min_bucket = bucket_df_1.groupBy('Id').min('buckets').withColumnRenamed('min(buckets)' , 'max_delay')
    joinedDf = par.join(min_bucket, ["Id"])
    #joinedDf.printSchema()

希望这有帮助,祝你好运!