Scala:如何按时间戳将Iterable [T]分组为Iterable [T]

时间:2018-06-02 15:04:48

标签: scala group-by iterator

我想编写一个代码来对行迭代器输入进行分组:Iterator[InputRow]通过时间戳记一个唯一的项目(uniteventName),即eventTime应该是新Iterator[T]列表中的最新时间戳,InputRow定义为

case class InputRow(unit:Int, eventName: String, eventTime:java.sql.Timestamp, value: Int)

分组前的示例数据:

+-----------------------+----+---------+-----+
|eventTime              |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11    |2   |B        |1    |
|2018-06-02 16:05:12    |1   |A        |2    |
|2018-06-02 16:05:13    |2   |A        |2    |
|2018-06-02 16:05:14    |1   |A        |3    |
|2018-06-02 16:05:15    |2   |A        |3    |

后:

+-----------------------+----+---------+-----+
|eventTime              |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11    |2   |B        |1    |
|2018-06-02 16:05:14    |1   |A        |3    |
|2018-06-02 16:05:15    |2   |A        |3    |

在Scala中编写上述代码有什么好方法?

3 个答案:

答案 0 :(得分:3)

好消息:你的问题已经包含了与代码中使用的函数调用相对应的动词:group by,sort by(最新时间戳)。

要按最新时间戳排序InputRow,我们需要一个隐式排序:

implicit val rowSortByTimestamp: Ordering[InputRow] = 
    (r1: InputRow, r2: InputRow) => r1.eventTime.compareTo(r2.eventTime)
// or shorter:
// implicit val rowSortByTimestamp: Ordering[InputRow] = 
//   _.eventTime compareTo _.eventTime

现在,

val input: Iterator[InputRow] = // input data

让我们将它们分组(unit,eventName)

val result = input.toSeq.groupBy(row => (row.unit, row.eventName))

然后提取具有最新时间戳的那个

  .map { case (gr, rows) => rows.sorted.last }

从最早到最新排序

  .toSeq.sorted

结果是

InputRow(2,B,2018-06-02 16:05:11.0,1)
InputRow(1,A,2018-06-02 16:05:14.0,3)
InputRow(2,A,2018-06-02 16:05:15.0,3)

答案 1 :(得分:1)

您可以使用struct 内置功能eventTimevalue列合并为struct,以便max groupBy eventTimeunit以及汇总时可以eventName(最新),这可以为您提供所需的输出

import org.apache.spark.sql.functions._
df.withColumn("struct", struct("eventTime", "value"))
    .groupBy("unit", "eventName")
    .agg(max("struct").as("struct"))
    .select(col("struct.eventTime"), col("unit"), col("eventName"), col("struct.value"))

as

+-------------------+----+---------+-----+
|eventTime          |unit|eventName|value|
+-------------------+----+---------+-----+
|2018-06-02 16:05:14|1   |A        |3    |
|2018-06-02 16:05:11|2   |B        |1    |
|2018-06-02 16:05:15|2   |A        |3    |
+-------------------+----+---------+-----+

答案 2 :(得分:0)

您可以使用foldLeftmap

来实现这一目标
val grouped: Map[(Int, String), InputRow] = 
  rows
    .foldLeft(Map.empty[(Int, String), Seq[InputRow]])({ case (acc, row) =>
     val key = (row.unit, row.eventName)
     // Get from the accumulator the Seq that already exists or Nil if
     // this key has never been seen before
     val value = acc.getOrElse(key, Nil)
     // Update the accumulator
     acc + (key -> (value :+ row))
  })
  // Get the last element from the list of rows when grouped by unit and event.
  .map({case (k, v) => k -> v.last})

这假定eventTime已经按排序顺序存储。如果这不是一个安全的假设,您可以为implicit Ordering定义java.sql.Timestamp并将v.last替换为v.maxBy(_.eventTime)

请参阅here

修改

或使用.groupBy(row => (row.unit, row.eventName))代替foldLeft

implicit val ordering: Ordering[Timestamp] = _ compareTo _
val grouped = rows.groupBy(row => (row.unit, row.eventName))
                  .values
                  .map(_.maxBy(_.eventTime))