Question

我对此很新。我的问题是：对于案例类case class testclass(date_key: String , amount: Int, type:String, condition1:String, condition2: String)

在Dataframe df中，当我的amount

行时，我正试图将type:String，按condition1=condition2分组。

我正在尝试定义一个函数但是我应该怎么做呢？非常感谢！

 `def sumAmount (t: testclass): Int = { 
      if (condition1==condition2) 
   {

   } else {
       "na"
   }
  }`

Answer 1

我假设您已使用#co-page-orderlist { height: 500px; max-height: 500px; overflow: scroll; }

dataframe

case class

出于测试目的，我创建了一个测试case class testclass(date_key: String , amount: Int, types: String, condition1: String, condition2: String)

dataframe

应该给你

import sqlContext.implicits._
val df = Seq(
  testclass("2015-01-01", 332, "types", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types", "condition1", "condition2"),
  testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
  testclass("2015-01-01", 332, "types2", "condition1", "condition2")
).toDF

df.filter($"condition1" === $"condition2")
  .groupBy("types")
  .agg(sum("amount").as("sum"))
  .show(false)

你应该有所需的结果

+------+---+
|types |sum|
+------+---+
|types |664|
|types2|996|
+------+---+

<强>更新

如果您想使用dataSet代替dataframe，可以使用.toDS代替.toDF

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> case class testclass(date_key: String , amount: Int, types: String, condition1: String, condition2: String)
defined class testclass

scala> val ds = Seq(
     | testclass("2015-01-01", 332, "types", "condition1", "condition1"),
     |       testclass("2015-01-01", 332, "types", "condition1", "condition1"),
     |       testclass("2015-01-01", 332, "types", "condition1", "condition2"),
     |       testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
     |       testclass("2015-01-01", 332, "types2", "condition1", "condition2")
     |     ).toDS
ds: org.apache.spark.sql.Dataset[testclass] = [date_key: string, amount: int ... 3 more fields]

您可以看到它是dataset而不是dataframe

其余步骤如上所述。

Answer 2

您需要先在data.condition1.equals(data.condition2)
然后是groupBy数据类型，它将dataType作为键，将case类列表作为值
然后最后将值列在值列表中

示例（无火花）

case class MyData(dataKey: String, amount: Int, dataType: String, condition1: String, condition2: String)

val grouped = List(MyData("a", 1000, "type1", "matches1", "matches1"),
  MyData("b", 1000, "type1", "matches1", "matches1"),
  MyData("c", 1000, "type1", "matches1", "matches2"),
  MyData("d", 1000, "type2", "matches1", "matches1")
).filter(data => data.condition1.equals(data.condition2))
  .groupBy(_.dataType)
  .map{ case (dataType, values) =>
    dataType -> values.map(_.amount).sum
  }.toMap

grouped("type1") shouldBe 2000
grouped("type2") shouldBe 1000

Scala-如何对案例类进行GroupBy和Sum？

2 个答案: