Spark row_number分区按无序排列可保持自然顺序

时间:2020-08-07 04:37:52

标签: apache-spark apache-spark-sql

我想实现无任何顺序的分区,以便数据可以在数据框中保持其自然排序。请分享任何建议,谢谢。

考虑在Spark数据框中有以下数据

         raw data
----------------------------
 name | item id |   action
----------------------------
 John |    120  |   sell 
----------------------------
 John |    320  |   buy
----------------------------
 Jane |    120  |   sell 
----------------------------
 Jane |    450  |   buy
----------------------------
 Sam  |    360  |   sell 
----------------------------
 Sam  |    300  |   hold
----------------------------
 Sam  |    450  |   buy
----------------------------
 Tim  |    470  |   buy
----------------------------

此表模式中有几条规则

1. Every one has at least one action `buy`
2. Every one's last action must be `buy` as well

现在我想添加一个序列列,只是为了向所有人显示操作顺序

            expectation
--------------------------------------
 name | item id |   action  |  seq   
--------------------------------------
 John |    120  |   sell    |  1
--------------------------------------
 John |    320  |   buy     |  2
--------------------------------------
 Jane |    120  |   sell    |  1
--------------------------------------
 Jane |    450  |   buy     |  2
--------------------------------------
 Sam  |    360  |   sell    |  1
--------------------------------------
 Sam  |    300  |   hold    |  2
--------------------------------------
 Sam  |    450  |   buy     |  3
--------------------------------------
 Tim  |    470  |   buy     |  1
--------------------------------------

这是我的代码

import org.apache.spark.sql.functions.{row_number}
import org.apache.spark.sql.expressions.Window
....

val df = spark.read.json(....)
val spec = Window.partitionBy($"name").orderBy(lit(1))         <-- don't know what to used for order by

val dfWithSeq = df.withColumn("seq", row_number.over(spec))   <--- please show me the magic

有趣的是,从dfWithSeq返回的结果显示,每个人的行为都有随机的序列,因此使用seq时,行为不再遵循原始数据表中给出的顺序。但是我找不到解决方案。

           actual result
--------------------------------------
 name | item id |   action  |  seq   
--------------------------------------
 John |    120  |   sell    |  1
--------------------------------------
 John |    320  |   buy     |  2
--------------------------------------
 Jane |    120  |   sell    |  2          <-- this is wrong
--------------------------------------
 Jane |    450  |   buy     |  1          <-- this is wrong
--------------------------------------
 Sam  |    360  |   sell    |  1
--------------------------------------
 Sam  |    300  |   hold    |  2
--------------------------------------
 Sam  |    450  |   buy     |  3
--------------------------------------
 Tim  |    470  |   buy     |  1
--------------------------------------

2 个答案:

答案 0 :(得分:1)

需要使用:

    转换为RDD并返回DF后的
  • zipWithIndex。这是一个狭窄的转换,将保留您的(初始)数据顺序。
  • 然后您通过适当考虑名称中的序列号或进行分区。

让您解决其余的问题。

答案 1 :(得分:1)

使用monotonically_increasing_id

import org.apache.spark.sql.functions.{row_number, monotonically_increasing_id}
import org.apache.spark.sql.expressions.Window
....

val df = spark.read.json(....)
val spec = Window.partitionBy($"name").orderBy($"order")

val dfWithSeq = df.withColumn("order", monotonically_increasing_id)
  .withColumn("seq", row_number.over(spec))
相关问题