在Scala中,很容易避免联接操作后出现重复的列:
df1.join(df1, Seq("id"), "left").show()
但是,PySpark中有类似的解决方案吗?如果我在PySpark中进行df1.join(df1, df1["id"] == df2["id"], "left").show()
,则会得到两列id
...
答案 0 :(得分:0)
您有3个选择:
1. Use outer join
aDF.join(bDF, "id", "outer").show()
2. Use Aliasing: You will lose data related to B Specific Id's in this.
aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()
3. Use drop to drop the columns
columns_to_drop = ['ida', 'idb']
df = df.drop(*columns_to_drop)
让我知道是否有帮助。