如何在JSON中聚合数组?

时间:2017-05-03 11:38:07

标签: scala apache-spark apache-spark-sql

我有一个关于在嵌套JSON数组上进行聚合的问题。我有样本订单数据框或(显示为JSON),如下所示:

{
  "orderId": "oi1",
  "orderLines": [
    {
      "productId": "p1",
      "quantity": 1,
      "sequence": 1,
      "totalPrice": {
        "gross": 50,
        "net": 40,
        "tax": 10
      }
    },
    {
      "productId": "p2",
      "quantity": 3,
      "sequence": 2,
      "totalPrice": {
        "gross": 300,
        "net": 240,
        "tax": 60
      }
    }
  ]
}

如何使用Spark SQL对给定订单的所有行的数量求和'?

例如在这种情况下1 + 3 = 4

我想在下面写一下,但没有相似内置功能支持它会出现(除非我错过了它可能!)

SELECT
  orderId,
  sum_array(orderLines.quantity) as totalQuantityItems
FROM
   orders

可能需要自定义UDF(Scala)吗?如果是这样/任何例子,这会是什么样的? 即使进一步进入嵌套,总计项目总和

SELECT
  orderId,
  sum_array(orderLines.totalPrice.net) as totalOrderNet
FROM
   orders

1 个答案:

答案 0 :(得分:2)

使用spark.read.json读取数据集。

val orders = spark.
  read.
  option("wholeFile", true).
  json("orders.json").
  as[(String, Seq[(String, Long, Long, (Long, Long, Long))])]
scala> orders.show(truncate = false)
+-------+--------------------------------------------+
|orderId|orderLines                                  |
+-------+--------------------------------------------+
|oi1    |[[p1,1,1,[50,40,10]], [p2,3,2,[300,240,60]]]|
+-------+--------------------------------------------+

scala> orders.map { case (id, lines) => (id, lines.map(_._2).sum) }.toDF("id", "sum").show
+---+---+
| id|sum|
+---+---+
|oi1|  4|
+---+---+

你可以使它更漂亮"更漂亮"使用Scala进行理解。

val quantities = for {
  o <- orders
  id = o._1
  quantity <- o._2
} yield (id, quantity._2)

val sumPerOrder = quantities.
  toDF("id", "quantity"). // <-- back to DataFrames to have names
  groupBy("id").
  agg(sum("quantity") as "sum")
scala> sumPerOrder.show
+---+---+
| id|sum|
+---+---+
|oi1|  4|
+---+---+