Question

假设我生成了一个Spark Dataframe：

val df = Seq(
    (Array(1, 2, 3), Array("a", "b", "c")),
    (Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")

可以在"Col1"中的第一个索引处提取元素，例如：

val extractFirstInt = udf { (x: Seq[Int], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirstInt($"Col1", lit(1)))

类似于第二列"Col2"，例如

val extractFirstString = udf { (x: Seq[String], i: Int) => x(i) }
df.withColumn("Col2_1", extractFirstString($"Col2", lit(1)))

但代码重复有点难看 - 我需要为每个底层元素类型单独使用UDF。

有没有办法编写泛型 UDF，它会自动推断Spark数据集列中底层数组的类型？例如。我希望能够写出类似的内容（伪代码;使用泛型T）

val extractFirst = udf { (x: Seq[T], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirst($"Col1", lit(1)))

在某种程度上，类型T只能由Spark / Scala编译器自动推断（如果合适，可能使用反射）。

如果您了解一个既适用于阵列列又适用于Spark自己的DenseVector / SparseVector类型的解决方案，则可获得积分。我想避免的主要事情（如果可能的话）是要为我想要处理的每个底层数组元素类型定义一个单独的UDF。

Answer 1

也许frameless可能是一个解决方案？

由于操作数据集对于给定类型需要Encoder，因此您必须预先定义类型，以便Spark SQL可以为您创建一个类型。我认为生成各种编码器支持的类型的Scala宏在这里是有意义的。

截至目前，我已经为每种类型定义了一个泛型方法和一个UDF（这违背了你希望找到一种方法来使用＆＃34;一个通用UDF，它会自动推断出类型Spark数据集＆＃34; 中的基础数组。

def myExtract[T](x: Seq[T], i: Int) = x(i)
// define UDF for extracting strings
val extractString = udf(myExtract[String] _)

使用如下：

val df = Seq(
    (Array(1, 2, 3), Array("a", "b", "c")),
    (Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")

scala> df.withColumn("Col1_1", extractString($"Col2", lit(1))).show
+---------+---------+------+
|     Col1|     Col2|Col1_1|
+---------+---------+------+
|[1, 2, 3]|[a, b, c]|     b|
|[1, 2, 3]|[a, b, c]|     b|
+---------+---------+------+

您可以改为探索Dataset（不是DataFrame，即Dataset[Row]）。这将为您提供所有类型的机器（也许您可以避免任何宏开发）。

Answer 2

根据@ zero323的建议，我集中讨论以下形式的实现：

def extractFirst(df: DataFrame, column: String, into: String) = {

  // extract column of interest
  val col = df.apply(column)

  // figure out the type name for this column
  val schema = df.schema
  val typeName = schema.apply(schema.fieldIndex(column)).dataType.typeName

  // delegate based on column type
  typeName match {

    case "array"  => df.withColumn(into, col.getItem(0))
    case "vector" => {
      // construct a udf to extract first element
      // (could almost certainly do better here,
      // but this demonstrates the strategy regardless)
      val extractor = udf {
        (x: Any) => {
          val el = x.getClass.getDeclaredMethod("toArray").invoke(x)
          val array = el.asInstanceOf[Array[Double]]
          array(0)
        }
      }

      df.withColumn(into, extractor(col))
    }

    case _ => throw new IllegalArgumentException("unexpected type '" + typeName + "'")
  }
}

如何将数组或向量列分成多列？

2 个答案: