Scala-如何对案例类进行GroupBy和Sum?

时间:2017-07-21 00:18:20

标签: scala apache-spark dataframe group-by

我对此很新。我的问题是:对于案例类case class testclass(date_key: String , amount: Int, type:String, condition1:String, condition2: String)

Dataframe df中,当我的amount

行时,我正试图将type:String,按condition1=condition2分组。

我正在尝试定义一个函数但是我应该怎么做呢?非常感谢!

 `def sumAmount (t: testclass): Int = { 
      if (condition1==condition2) 
   {

   } else {
       "na"
   }
  }`

2 个答案:

答案 0 :(得分:2)

我假设您已使用#co-page-orderlist { height: 500px; max-height: 500px; overflow: scroll; }

dataframe
case class

出于测试目的,我创建了一个测试case class testclass(date_key: String , amount: Int, types: String, condition1: String, condition2: String)

dataframe

应该给你

import sqlContext.implicits._
val df = Seq(
  testclass("2015-01-01", 332, "types", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types", "condition1", "condition2"),
  testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types2", "condition1", "condition2")
).toDF

现在,您希望在+----------+------+------+----------+----------+ |date_key |amount|types |condition1|condition2| +----------+------+------+----------+----------+ |2015-01-01|332 |types |condition1|condition1| |2015-01-01|332 |types |condition1|condition1| |2015-01-01|332 |types |condition1|condition2| |2015-01-01|332 |types2|condition1|condition1| |2015-01-01|332 |types2|condition1|condition1| |2015-01-01|332 |types2|condition1|condition1| |2015-01-01|332 |types2|condition1|condition2| +----------+------+------+----------+----------+ groupBy types列和sum amount。为此,您condition1 = condtion2filter condition1=condition2groupBy aggregation的行sum

df.filter($"condition1" === $"condition2")
  .groupBy("types")
  .agg(sum("amount").as("sum"))
  .show(false)

你应该有所需的结果

+------+---+
|types |sum|
+------+---+
|types |664|
|types2|996|
+------+---+

<强>更新

如果您想使用dataSet代替dataframe,可以使用.toDS代替.toDF

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> case class testclass(date_key: String , amount: Int, types: String, condition1: String, condition2: String)
defined class testclass

scala> val ds = Seq(
     | testclass("2015-01-01", 332, "types", "condition1", "condition1"),
     |       testclass("2015-01-01", 332, "types", "condition1", "condition1"),
     |       testclass("2015-01-01", 332, "types", "condition1", "condition2"),
     |       testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
     |       testclass("2015-01-01", 332, "types2", "condition1", "condition2")
     |     ).toDS
ds: org.apache.spark.sql.Dataset[testclass] = [date_key: string, amount: int ... 3 more fields]

您可以看到它是dataset而不是dataframe

其余步骤如上所述。

答案 1 :(得分:0)

  • 您需要先在data.condition1.equals(data.condition2)
  • 的位置过滤数据收集
  • 然后是groupBy数据类型,它将dataType作为键,将case类列表作为值
  • 然后最后将值列在值列表中

示例(无火花

case class MyData(dataKey: String, amount: Int, dataType: String, condition1: String, condition2: String)

val grouped = List(MyData("a", 1000, "type1", "matches1", "matches1"),
  MyData("b", 1000, "type1", "matches1", "matches1"),
  MyData("c", 1000, "type1", "matches1", "matches2"),
  MyData("d", 1000, "type2", "matches1", "matches1")
).filter(data => data.condition1.equals(data.condition2))
  .groupBy(_.dataType)
  .map{ case (dataType, values) =>
    dataType -> values.map(_.amount).sum
  }.toMap

grouped("type1") shouldBe 2000
grouped("type2") shouldBe 1000