根据其他两列中的值将新列添加到数据框(需要Pyspark)

时间:2020-08-06 07:51:48

标签: pyspark

enter image description here

我想基于“ nb_pred_x”和“ svm_pred_x”中的两个值添加一个名为“ joint_pred_x”(x = 0,1,2)的列,如果nb = 1,svm = 1,则添加0;如果nb = 1,svm = 0则加1;如果nb = 0,svm = 1则加2;如果nb = 0,svm = 0,则加3。 我认为withcolumn可以完成这项工作,但我对条件逻辑感到困惑。预先感谢,解决方案只需是pyspark

1 个答案:

答案 0 :(得分:0)

您可以使用case语句。

+---------+---------+---------+----------+----------+----------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|
+---------+---------+---------+----------+----------+----------+
|0.0      |1.0      |0.0      |0.0       |1.0       |0.0       |
+---------+---------+---------+----------+----------+----------+


from pyspark.sql.functions import expr

for i in range(0, 3):
    
    index = str(i)
    
    df = df.withColumn('joint_pred_' + index, expr(f'''
            CASE 
                WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 1 THEN 0
                WHEN {p1}_pred_{index} == 1 and {p2}_pred_{index} == 0 THEN 1
                WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 1 THEN 2
                WHEN {p1}_pred_{index} == 0 and {p2}_pred_{index} == 0 THEN 3
            END
        '''))

df.show(10, False)

+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|nb_pred_0|nb_pred_1|nb_pred_2|svm_pred_0|svm_pred_1|svm_pred_2|joint_pred_0|joint_pred_1|joint_pred_2|
+---------+---------+---------+----------+----------+----------+------------+------------+------------+
|0.0      |1.0      |0.0      |0.0       |1.0       |0.0       |3           |0           |3           |
+---------+---------+---------+----------+----------+----------+------------+------------+------------+