pyspark错误由以下原因引起:java.lang.IllegalStateException:输入行没有模式所需的预期值数

时间:2017-04-14 14:41:08

标签: python join dataframe pyspark

我有以下pyspark代码来连接两个数据帧。一切看起来很简单,但输出不是这个错误。无法继续下去,请你帮忙在这里找出这个根本问题吗?

输入

C.csv

100,2015-09-03,SG,7
200,2016-01-30,AT,9
300,2016-01-25,AU,8
400,2016-01-22,AU,7

U.csv

248,248,COUNTRY,SG,Singapore
66,66,COUNTRY,AT,Austria
65,65,COUNTRY,AU,Australia

输出

100,Singapore
200,Austria
300,Australia
400,Australia

来源

pyspark代码是:test.py

from pyspark import SparkConf, SparkContext
from pyspark.sql.types import StringType
from pyspark import SQLContext

conf = SparkConf().setAppName("HYBRID - READ CSV to HIVE ")
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)
C_rdd = sc.textFile("./hybrid/C.csv").map(lambda line: line.split(","))
R_rdd = sc.textFile("./hybrid/U.csv").map(lambda line: line.encode("ascii", "ignore").split(","))

C_df = C_rdd.toDF(['C_No','Op_Dt','Try_Cd','Lb'])
R_df = R_rdd.toDF(['C_Id','P_Id','CC_Cd','C_Nm','C_Ds'])

New = C_df.join(R_df, C_df.Try_Cd == R_df.C_Nm).select(['C_No','C_Ds'])
New.show()

结果

Pyspark Error: $spark-submit  test.py
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 5 fields are required while 6 values are provided.
        at org.apache.spark.sql.execution.EvaluatePython$.fromJava(python.scala:225)
        at org.apache.spark.sql.SQLContext$$anonfun$11.apply(SQLContext.scala:933)
        at org.apache.spark.sql.SQLContext$$anonfun$11.apply(SQLContext.scala:933)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)

你能帮忙解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

希望你使用spark 2.x +然后尝试这个 -

from pyspark.sql.types import StructType,StringType,IntegerType,StructField
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("HYBRID - READ CSV to HIVE ") \
    .getOrCreate()

cSchema = StructType([StructField("C_No", IntegerType()),
                     StructField("Op_Dt", StringType()),
                     StructField("Try_Cd", StringType()),
                     StructField("Lb", IntegerType())])

uSchema = StructType([StructField("C_Id", IntegerType()),
                     StructField("P_Id", IntegerType()),
                     StructField("CC_Cd", StringType()),
                     StructField("C_Nm", StringType()),
                     StructField("C_Ds", StringType())])

c_df  = spark.read.csv("c.csv",schema=cSchema)
u_df  = spark.read.csv("u.csv",schema=uSchema)

New = c_df.join(u_df, c_df.Try_Cd == u_df.C_Nm).select(c_df.C_No,u_df.C_Ds)
New.show()