Question

我想实现无任何顺序的分区，以便数据可以在数据框中保持其自然排序。请分享任何建议，谢谢。

考虑在Spark数据框中有以下数据

         raw data
----------------------------
 name | item id |   action
----------------------------
 John |    120  |   sell 
----------------------------
 John |    320  |   buy
----------------------------
 Jane |    120  |   sell 
----------------------------
 Jane |    450  |   buy
----------------------------
 Sam  |    360  |   sell 
----------------------------
 Sam  |    300  |   hold
----------------------------
 Sam  |    450  |   buy
----------------------------
 Tim  |    470  |   buy
----------------------------

此表模式中有几条规则

1. Every one has at least one action `buy`
2. Every one's last action must be `buy` as well

现在我想添加一个序列列，只是为了向所有人显示操作顺序

            expectation
--------------------------------------
 name | item id |   action  |  seq   
--------------------------------------
 John |    120  |   sell    |  1
--------------------------------------
 John |    320  |   buy     |  2
--------------------------------------
 Jane |    120  |   sell    |  1
--------------------------------------
 Jane |    450  |   buy     |  2
--------------------------------------
 Sam  |    360  |   sell    |  1
--------------------------------------
 Sam  |    300  |   hold    |  2
--------------------------------------
 Sam  |    450  |   buy     |  3
--------------------------------------
 Tim  |    470  |   buy     |  1
--------------------------------------

这是我的代码

import org.apache.spark.sql.functions.{row_number}
import org.apache.spark.sql.expressions.Window
....

val df = spark.read.json(....)
val spec = Window.partitionBy($"name").orderBy(lit(1))         <-- don't know what to used for order by

val dfWithSeq = df.withColumn("seq", row_number.over(spec))   <--- please show me the magic

有趣的是，从dfWithSeq返回的结果显示，每个人的行为都有随机的序列，因此使用seq时，行为不再遵循原始数据表中给出的顺序。但是我找不到解决方案。

           actual result
--------------------------------------
 name | item id |   action  |  seq   
--------------------------------------
 John |    120  |   sell    |  1
--------------------------------------
 John |    320  |   buy     |  2
--------------------------------------
 Jane |    120  |   sell    |  2          <-- this is wrong
--------------------------------------
 Jane |    450  |   buy     |  1          <-- this is wrong
--------------------------------------
 Sam  |    360  |   sell    |  1
--------------------------------------
 Sam  |    300  |   hold    |  2
--------------------------------------
 Sam  |    450  |   buy     |  3
--------------------------------------
 Tim  |    470  |   buy     |  1
--------------------------------------

Answer 1

需要使用：

zipWithIndex。这是一个狭窄的转换，将保留您的（初始）数据顺序。
然后您通过适当考虑名称中的序列号或进行分区。

让您解决其余的问题。

Answer 2

使用monotonically_increasing_id。

import org.apache.spark.sql.functions.{row_number, monotonically_increasing_id}
import org.apache.spark.sql.expressions.Window
....

val df = spark.read.json(....)
val spec = Window.partitionBy($"name").orderBy($"order")

val dfWithSeq = df.withColumn("order", monotonically_increasing_id)
  .withColumn("seq", row_number.over(spec))

Spark row_number分区按无序排列可保持自然顺序

2 个答案: