Question

我正在寻找一个API，该API允许我根据可以访问整个Row的函数输出添加一列。这类似于调用Dataset#filter(FilterFunction)

的功能

作为一个例子，假设我有以下DF

+----+----+----+
| c0 | c1 | c2 |
+----+----+----+
| 1  | 2  | 3  |
+----+----+----+

我希望能够创建一个新列

df.withColumn("c3", row ->
  row.getInt(0) + row.getInt(1) + row.getInt(2));

并导致

+----+----+----+----+
| c0 | c1 | c2 | c3 |
+----+----+----+----+
| 1  | 2  | 3  | 6  |
+----+----+----+----+

这是一个过于简化的示例，所讨论的功能要复杂得多，并在运行时构建。

Answer 1

您可以使用map：

map(MapFunction<T,U> func, Encoder<U> encoder)

并重建整个Row，或将所需的所有列与struct结合使用udf：

import static org.apache.spark.sql.functions.*;

UserDefinedFunction f = udf(
  (Row row) -> row.getInt(0) + row.getInt(1) + row.getInt(2), 
  DataTypes.IntegerType
);


df.withColumn("c3", f(struct(col("c1"), col("c2"), col("c3"))));

但是这两种方法的效率都将大大低于使用标准SQL表达式的效率。

给定一个Dataset <row>是否有一个API函数需要一个Row添加一个列？

1 个答案: