Spark - 行值的总和

时间:2016-04-01 15:47:14

标签: scala apache-spark

我有以下DataFrame:

January | February | March
-----------------------------
  10    |    10    |  10
  20    |    20    |  20
  50    |    50    |  50

我试图在此处添加一列,这是每行值的总和。

January | February | March  | TOTAL
----------------------------------
  10    |    10    |   10   |  30
  20    |    20    |   20   |  60
  50    |    50    |   50   |  150

据我所知,所有内置的聚合函数似乎都是用于计算单列中的值。如何在每行的基础上跨列使用值(使用Scala)?

我已经到了

val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...

5 个答案:

答案 0 :(得分:14)

你非常接近这个:

val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...

相反,试试这个:

val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")

我认为这是最好的答案,因为它与使用硬编码的SQL查询的答案一样快,并且与使用UDF的答案一样方便。这是两全其美的 - 我甚至没有添加完整的代码!

答案 1 :(得分:9)

或者使用Hugo的方法和示例,您可以创建一个UDF来接收任意数量的列,并sum全部列。

from functools import reduce

def superSum(*cols):
   return reduce(lambda a, b: a + b, cols)

add = udf(superSum)

df.withColumn('total', add(*[df[x] for x in df.columns])).show()


+-------+--------+-----+-----+
|January|February|March|total|
+-------+--------+-----+-----+
|     10|      10|   10|   30|
|     20|      20|   20|   60|
+-------+--------+-----+-----+

答案 2 :(得分:8)

此代码在Python中,但可以轻松翻译:

# First we create a RDD in order to create a dataFrame:
rdd = sc.parallelize([(10, 10,10), (20, 20,20)])
df = rdd.toDF(['January', 'February', 'March'])
df.show()

# Here, we create a new column called 'TOTAL' which has results
# from add operation of columns df.January, df.February and df.March

df.withColumn('TOTAL', df.January + df.February + df.March).show()

输出:

+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
|     10|      10|   10|
|     20|      20|   20|
+-------+--------+-----+

+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
|     10|      10|   10|   30|
|     20|      20|   20|   60|
+-------+--------+-----+-----+

您还可以创建所需的用户定义函数,这里是Spark文档的链接: UserDefinedFunction (udf)

答案 3 :(得分:5)

使用动态列选择的Scala示例:

import sqlContext.implicits._
val rdd = sc.parallelize(Seq((10, 10, 10), (20, 20, 20)))
val df = rdd.toDF("January", "February", "March")
df.show()

+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
|     10|      10|   10|
|     20|      20|   20|
+-------+--------+-----+

val sumDF = df.withColumn("TOTAL", df.columns.map(c => col(c)).reduce((c1, c2) => c1 + c2))
sumDF.show()

+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
|     10|      10|   10|   30|
|     20|      20|   20|   60|
+-------+--------+-----+-----+

答案 4 :(得分:4)

你可以使用expr()。在scala中使用

df.withColumn("TOTAL", expr("January+February+March"))