无法让Spark聚合器工作

时间:2018-03-23 00:53:32

标签: scala apache-spark apache-spark-sql aggregate-functions user-defined-functions

我想在Scala Spark中尝试聚合器,但我似乎无法使用select函数和groupBy/agg函数(我当前的实现{{1})使它们工作函数无法编译)。我的聚合器写在下面,应该是自我解释的。

agg

以下是我的测试代码。

import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder, Encoders}

/** Stores the number of true counts (tc) and false counts (fc) */
case class Counts(var tc: Long, var fc: Long)

/** Count the number of true and false occurances of a function */
class BooleanCounter[A](f: A => Boolean) extends Aggregator[A, Counts, Counts] with Serializable {
  // Initialize both counts to zero
  def zero: Counts = Counts(0L, 0L) 
  // Sum counts for intermediate value and new value
  def reduce(acc: Counts, other: A): Counts = { 
    if (f(other)) acc.tc += 1 else acc.fc += 1
    acc 
  }
  // Sum counts for intermediate values
  def merge(acc1: Counts, acc2: Counts): Counts = { 
    acc1.tc += acc2.tc
    acc1.fc += acc2.fc
    acc1
  }
  // Return results
  def finish(acc: Counts): Counts = acc 
  // Encoder for intermediate value type
  def bufferEncoder: Encoder[Counts] = Encoders.product[Counts]
  // Encoder for return type
  def outputEncoder: Encoder[Counts] = Encoders.product[Counts]
}

val ds: Dataset[Employee] = Seq( Employee("John", 110), Employee("Paul", 100), Employee("George", 0), Employee("Ringo", 80) ).toDS() val salaryCounter = new BooleanCounter[Employee]((r: Employee) => r.salary < 10).toColumn // Usage works fine ds.select(salaryCounter).show() // Causes an error ds.groupBy($"name").agg(salaryCounter).show() 的第一次使用正常,但第二种情况导致以下编译错误。

salaryCounter

Databricks的tutorial相当复杂,但似乎是Spark 2.3。还有this旧教程使用Spark 1.6的实验性功能。

1 个答案:

答案 0 :(得分:4)

你错误地混合&#34;静态输入&#34;和&#34;动态输入&#34;蜜蜂。要使用以前的版本,您应在agg上致电KeyValueGroupedDataset,而不是RelationalGroupedDataset

ds.groupByKey(_.name).agg(salaryCounter)