如何在保留最新数据的情况下从Spark数据框中删除重复项?

时间:2019-04-12 22:18:02

标签: pyspark apache-spark-sql

我正在使用spark从Amazon S3加载json文件。我想基于保留最新的数据帧的两列删除重复项(我有时间戳列)。最好的方法是什么?请注意,重复项可能分散在各个分区中。我可以删除保留最后记录的重复项而不进行改组吗?我正在处理1 TB的数据。

我当时正在考虑将数据帧划分为这两列,这样所有重复的记录都将被“一致地散列”到同一分区中,因此在分区级别排序后进行丢弃重复将消除所有重复项,而只保留一个。我不知道是否有可能。任何信息表示赞赏。

1 个答案:

答案 0 :(得分:3)

使用 row_number()窗口功能可能更容易执行任务,c1下面是时间戳列,c2c3是用于分区的列您的数据:

from pyspark.sql import Window, functions as F

# create a win spec which is partitioned by c2, c3 and ordered by c1 in descending order
win = Window.partitionBy('c2', 'c3').orderBy(F.col('c1').desc())

# set rn with F.row_number() and filter the result by rn == 1
df_new = df.withColumn('rn', F.row_number().over(win)).where('rn = 1').drop('rn')
df_new.show()

编辑:

如果您只需要重复项并放置唯一的行,则添加另一个字段:

from pyspark.sql import Window, functions as F

# create a win spec which is partitioned by c2, c3 and ordered by c1 in descending order
win = Window.partitionBy('c2', 'c3').orderBy(F.col('c1').desc())

# window to cover all rows in the same partition
win2 = Window.partitionBy('c2', 'c3') \
             .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)

# set new columns: rn, cnt and filter the result by rn == 1 and cnt > 1
df_new = df.withColumn('rn', F.row_number().over(win)) \
           .withColumn('cnt', F.count('c1').over(win2)) \
           .where('rn = 1 and cnt > 1') \
           .drop('rn', 'cnt')
df_new.show()