我是 pysaprk 的新手,我有两个表,我正在尝试填充值 某些列存在于其他表中但位于不同的列中
Table1 original_val 是缺少 person 行值的列,见下
#+-------------+----------+----------------+
#| name | Value | original_val |
#+-------------+----------+----------------+-
#|Movie_name | RHDM | 123 |
#|teacher_name | Rohit | 345 |
#| person | kerry | |
#| person | Suzen | |
#| person | JD_Jem | |
#| | | |
Table2 以下结构包含人名和一些值,注意所有人名和值都分布在不同的列中,如下所示
#+-----------+----------+-----------+---------+----------+--------+
#| key | value_1 | key2 |value_2 |key3 |value_3 |
#+-----------+----------+-----------+--------------------+--------+
#| kerry | 540 | | | JD_Jem | 888 |
#| | | Suzen | 123 | | |
#| | | | | | |
预期输出 在表 1 中,我正在查看缺少的输出 kerry、Suzen、JD_Jem 值需要 从 Table2 中填充,如下所示
#+-------------+----------+----------------+
#| name | Value | original_val |
#+-------------+----------+----------------+-
#|Movie_name | RHDM | 123 |
#|teacher_name | Rohit | 345 |
#| person | kerry | 540 |
#| person | Suzen | 123 |
#| person | JD_Jem | 888 |
#| | | |
我尝试了以下但没有得到确切的预期结果
select distinct t1.*,t2.value_1 as id from Table1 as t1 left join Table2 t2
on t1.Value=t2.key
union
select distinct t1.*,t2.value_2 as id from Table1 as t1 left join Table2 t2
on t1.Value=t2.key2
union
select distinct t1.*,t2.value_3 as id from Table1 as t1 left join Table2 t2
on t1.Value=t2.key3
答案 0 :(得分:1)
我正在编写一个基于数据框 api 的解决方案。 假设 table1_df 和 table2_df
#bring your table2 to proper format
key_value_df = table2_df.select("key","value_1")
.union(table2_df.select("key2","value_2"))
.union(table2_df.select("key3","value_3"))
# .filter ( your logic to remove empty/nulll fields)
#+-----------+----------+
#| key | value_1 |
#+-----------+----------+
#| kerry | 540 |
#| | |
#| | |
#| Suzen | 123 |
#| | |
#| JD_Jem | 888 |
# Now join
joined_df = table1_df.join(key_value_df,table1_df.Value == key_value_df.key,"left")
#+-------------+----------+----------------+-----------+----------+
#| name | Value | original_val | key | value_1 |
#+-------------+----------+----------------+-----------+----------+
#now fill in the values from value_1 into original_val for all empty original_val
final_df = joined_df.withColumn('original_val',
F.when(F.col("original_val") == "", F.col("value_1"))
.otherwise(F.col("original_val")))
.drop("key","value_1")
您可以用适当的条件替换 F.col("original_val") == ""
。