Question

我有一个数据集，当前有233,465行，并且每天增长大约10,000行。我需要从整个数据集中随机选择行以用于ML训练。我为“索引”添加了“ id”列。

from pyspark.sql.functions import monotonically_increasing_id
spark_df = n_data.withColumn("id", monotonically_increasing_id())

我执行以下代码，希望看到返回5行，其中id与“索引”列表匹配，计数为5。

indices = [1000, 999, 45, 1001, 1823, 123476]
result = spark_df.filter(col("id").isin(indices))
result.show()
print(result.count())

相反，我得到3行。我得到45、1000和1001的ID。

关于这里可能出什么问题的任何想法？这看起来很干。

谢谢！

Answer 1

没有直接函数可以调用以为每行分配唯一的顺序ID。但是，可以使用基于window的函数来解决此问题。

df = spark.createDataFrame([(3,),(7,),(9,),(1,),(-3,),(5,)], ["values"])
df.show()

+------+
|values|
+------+
|     3|
|     7|
|     9|
|     1|
|    -3|
|     5|
+------+



df = (df.withColumn('dummy', F.monotonically_increasing_id())
       .withColumn('ID', F.row_number().over(Window.orderBy('dummy')))
       .drop('dummy'))
df.show()

+------+---+
|values| ID|
+------+---+
|     3|  1|
|     7|  2|
|     9|  3|
|     1|  4|
|    -3|  5|
|     5|  6|
+------+---+

PySpark添加ID列和过滤器损坏

1 个答案: