Question

我有一个包含多个列的Dataframe和一个列名列表。

我想根据我的列表对其进行分组来处理我的Dataframe。

这是我想要做的一个例子：

val tagList = List("col1","col3","col5")

var tagsForGroupBy = tagList(0)

if(tagList.length>1){
     for(i <- 1 to tagList.length-1){
              tagsForGroupBy = tagsForGroupBy+","+tags(i)
     }
}

// df is a Dataframe with schema (col0, col1, col2, col3, col4, col5)
df.groupBy("col0",tagsForGroupBy)

我理解为什么它不起作用，但我不知道如何使它发挥作用。

这样做的最佳解决方案是什么？

编辑：

以下是我正在做的更完整的示例（包括SCouto解决方案）：

我的tagList包含一些列名（“col3”，“col5”）。我也希望在我的groupBy中包含“col0”和“col1”，与我的列表无关。在我的groupBy和我的聚合之后，我想选择用于分组By的所有列和来自聚合的新列。

val tagList = List("col3","col5")

val tmpListForGroup = new ListBuffer[String]()
val tmpListForSelect = new ListBuffer[String]()
tmpListForGroup +=tagList (0)
tmpListForSelect +=tagList (0)

for(i <- 1 to tagList .length-1){
    tmpListForGroup +=(tagList (i))
    tmpListForSelect +=(tagList (i))
}

tmpListForGroup +="col0"
tmpListForGroup +="col1"
tmpListForSelect +="aggValue1"
tmpListForSelect +="aggValue2"

// df is a Dataframe with schema (col0, col1, col2, col3, col4, col5)
df.groupBy(tmpListForGroup.head,tmpListForGroup.tail:_*)
  .agg(
      [aggFunction].as("aggValue1"),
      [aggFunction].as("aggValue1"))
  )
  .select(tmpListForSelect .head,tmpListForSelect .tail:_*)

这段代码完全符合我的要求，但对于（我认为）应该简单的东西来说，它看起来非常难看和复杂。

还有其他解决方案吗？

Answer 1

当将列名称作为字符串发送时，groupBy接收一列作为第一个参数，并将它们的序列作为第二个参数：

def groupBy(col1: String,cols: String*)

所以你需要发送两个参数并将第二个参数转换为序列：

这对你很好：

df.groupBy(tagsForGroupBy.head, tagsForGroupBy.tail:_*)

或者，如果您想从列表中分离col0，如示例所示：

df.groupBy("col0", tagsForGroupBy:_*)

Dataframe：GroupBy列名列表

1 个答案: