Question

是否可以创建一个返回列集的UDF？

即。拥有如下数据框：

| Feature1 | Feature2 | Feature 3 |
| 1.3      | 3.4      | 4.5       |

现在我想提取一个新特征，可以将其描述为两个元素的向量（例如，在线性回归中看到 - 斜率和偏移）。所需的数据集应如下所示：

| Feature1 | Feature2 | Feature 3 | Slope | Offset |
| 1.3      | 3.4      | 4.5       | 0.5   | 3      |

是否可以使用单个UDF创建多个列，或者是否需要遵循以下规则：＆＃34;每个UDF单个列＆＃34;？

Answer 1

结构方法

您可以将udf功能定义为

def myFunc: (String => (String, String)) = { s => (s.toLowerCase, s.toUpperCase)}

import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)

并使用.*作为

val newDF = df.withColumn("newCol", myUDF(df("Feature2"))).select("Feature1", "Feature2", "Feature 3", "newCol.*")

我已根据Tuple2函数返回udf进行测试（根据需要多少列可以使用更高阶的元组），它将被视为struct列。然后，您可以使用.*选择单独列中的所有元素，最后重命名它们。

您应该输出

+--------+--------+---------+---+---+
|Feature1|Feature2|Feature 3|_1 |_2 |
+--------+--------+---------+---+---+
|1.3     |3.4     |4.5      |3.4|3.4|
+--------+--------+---------+---+---+

您可以重命名_1和_2

数组方法

udf函数应返回array

def myFunc: (String => Array[String]) = { s => Array("s".toLowerCase, s.toUpperCase)}

import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)

您可以选择array的元素并使用alias重命名

val newDF = df.withColumn("newCol", myUDF(df("Feature2"))).select($"Feature1", $"Feature2", $"Feature 3", $"newCol"(0).as("Slope"), $"newCol"(1).as("Offset"))

你应该

+--------+--------+---------+-----+------+
|Feature1|Feature2|Feature 3|Slope|Offset|
+--------+--------+---------+-----+------+
|1.3     |3.4     |4.5      |s    |3.4   |
+--------+--------+---------+-----+------+

Answer 2

此外，您可以返回案例类：

case class NewFeatures(slope: Double, offset: Int)

val getNewFeatures = udf { s: String =>
      NewFeatures(???, ???)
    }

df
  .withColumn("newF", getNewFeatures($"Feature1"))
  .select($"Feature1", $"Feature2", $"Feature3", $"newF.slope", $"newF.offset")

如何使用UDF返回多列？

2 个答案: