Question

spark.createDataFrame创建一个空的DF

Answer 1

让我们首先创建一个最小的，可复制的问题实例。顺便说一句，这是您每次问问题时都应该尝试做的事;-）

// A RDD of string
val rdd = sc.parallelize(Seq("oli,15,56,0.5", "you,45,49987787,0.4"))

// your schema
val schema = new StructType() 
    .add("displayname", StringType, true)
    .add("reputation", IntegerType, true)
    .add("numberOfPosts", LongType, true)
    .add("score", DoubleType, true)

// Now, let's try to create a dataframe
val rddOfRows = rdd.map(_.split(",")).map(Row.fromSeq(_))
val df = spark.createDataFrame(rddOfRows, schema)
// we can print its schema
df.printSchema
root
 |-- displayname: string (nullable = true)
 |-- reputation: integer (nullable = true)
 |-- numberOfPosts: long (nullable = true)
 |-- score: double (nullable = true)

// but show triggers the exception you mentioned
df.show
  java.lang.RuntimeException: java.lang.String is not a valid external type for
  schema of int

为什么？您需要记住，火花是懒惰的。只要您不收集或写入数据，spark就不会执行任何操作。当您使用createDataFrame时，什么都不会发生。这就是为什么您不会得到任何错误的原因。当您打印架构时，spark只会打印您提供的架构。但是，当我调用show时，我要求spark做某事，并且触发所有相关的计算。

您看到的问题是spark需要一个int，但是您提供了一个字符串。创建数据框时，Spark不会投射数据。您有几种解决问题的可能性。一种解决方案是像这样预先填充字段：

val rddOfRow = rdd
  .map(_.split(","))
  .map(_ match { case Array(a, b, c, d) => (a, b.toInt, c.toLong, d.toDouble) })
  .map(Row.fromTuple(_))
// and the rest of the code remains unchanged

spark.createDataFrame创建一个空的DF

1 个答案: