spark从第二个表中存在但在不同列中的值填充第一个表中的列值

时间:2021-07-09 16:38:14

标签: apache-spark pyspark apache-spark-sql

我是 pysaprk 的新手,我有两个表,我正在尝试填充值 某些列存在于其他表中但位于不同的列中

Table1 original_val 是缺少 person 行值的列,见下

    #+-------------+----------+----------------+
    #| name        | Value    | original_val   |
    #+-------------+----------+----------------+-
    #|Movie_name   |  RHDM    |    123         | 
    #|teacher_name |  Rohit   |    345         | 
    #|  person     |  kerry   |                |
    #|  person     |  Suzen   |                |
    #|  person     |  JD_Jem  |                |
    #|             |          |                |

Table2 以下结构包含人名和一些值,注意所有人名和值都分布在不同的列中,如下所示

#+-----------+----------+-----------+---------+----------+--------+
#| key       | value_1  | key2      |value_2  |key3      |value_3 |
#+-----------+----------+-----------+--------------------+--------+
#|  kerry    |  540     |           |         |  JD_Jem  |  888   |
#|           |          |  Suzen    |  123    |          |        |
#|           |          |           |         |          |        |

预期输出 在表 1 中,我正在查看缺少的输出 kerry、Suzen、JD_Jem 值需要 从 Table2 中填充,如下所示

#+-------------+----------+----------------+
#| name        | Value    | original_val   |
#+-------------+----------+----------------+-
#|Movie_name   |  RHDM    |    123         | 
#|teacher_name |  Rohit   |    345         | 
#|  person     |  kerry   |    540         |
#|  person     |  Suzen   |    123         |
#|  person     |  JD_Jem  |    888         |
#|             |          |                |

我尝试了以下但没有得到确切的预期结果

select distinct t1.*,t2.value_1 as id from Table1 as t1 left join Table2 t2
on t1.Value=t2.key
union 
select distinct t1.*,t2.value_2 as id from Table1 as t1 left join Table2 t2
on t1.Value=t2.key2
union 
select distinct t1.*,t2.value_3 as id from Table1 as t1 left join Table2 t2
on t1.Value=t2.key3

1 个答案:

答案 0 :(得分:1)

我正在编写一个基于数据框 api 的解决方案。 假设 table1_df 和 table2_df

#bring your table2 to proper format 
key_value_df = table2_df.select("key","value_1")
              .union(table2_df.select("key2","value_2"))
              .union(table2_df.select("key3","value_3"))
              # .filter ( your logic to remove empty/nulll fields)

#+-----------+----------+
#| key       | value_1  |
#+-----------+----------+
#|  kerry    |  540     |  
#|           |          |
#|           |          |
#|  Suzen    |  123     |
#|           |          |
#|  JD_Jem   |  888     |
     
         
# Now join 
joined_df = table1_df.join(key_value_df,table1_df.Value ==  key_value_df.key,"left")

 #+-------------+----------+----------------+-----------+----------+
 #| name        | Value    | original_val   | key       | value_1  |
 #+-------------+----------+----------------+-----------+----------+

#now fill in the values from value_1 into original_val for all empty original_val

final_df =  joined_df.withColumn('original_val', 
            F.when(F.col("original_val") == "", F.col("value_1"))
            .otherwise(F.col("original_val")))
            .drop("key","value_1")

您可以用适当的条件替换 F.col("original_val") == ""

相关问题