在pyspark中水平连接多个数据框

时间:2018-06-22 14:09:23

标签: join indexing pyspark apache-spark-sql

我试图使用monotonically_increasing_id()在pyspark中水平连接多个数据框(具有相同记录数)。然而,获得的结果却增加了记录数量

for i in range(len(lst)+1):
    if i==0:
        df[i] = cust_mod.select('key')
        df[i+1] = df[i].withColumn("idx", monotonically_increasing_id())

    else:
        df_tmp = o[i-1].select(col("value").alias(obj_names[i-1]))
        df_tmp = df_tmp.withColumn("idx", monotonically_increasing_id())

        df[i+1] = df[i].join(df_tmp, "idx", "outer")

df [i + 1]中的预期记录数=〜60m。得到了:〜88m。似乎单调增加的id并不会一直生成相同的数字。我该如何解决这个问题?

其他详细信息:

cust_mod > dataframe, count- ~60m
o[i] - another set of dataframes, with length equal to cust_mod
lst - a list than has 49 components . So in total 49 loops

我尝试使用zipWithIndex():

for i in range(len(lst)+1):
    if i==0:
        df[i] = cust_mod.select('key')
        df[i+1] = df[i].rdd.zipWithIndex().toDF()

    else:
        df_tmp = o[i-1].select("value").rdd.zipWithIndex().toDF()
        df_tmp1 = df_tmp.select(col("_1").alias(obj_names[i-1]),col("_2"))

        df[i+1] = df[i].join(df_tmp1, "_2", "inner").drop(df_tmp1._2)

但是太慢了。像慢了50倍。

0 个答案:

没有答案
相关问题