我有一个名为mytable的表作为数据帧可用,下面是表
[+---+----+----+----+ | x| y| z| w| +---+----+----+----+ | 1| a|null|null| | 1|null| b|null| | 1|null|null| c| | 2| d|null|null| | 2|null| e|null| | 2|null|null| f| +---+----+----+----+]
我想用col x分组的结果和col y,z,w的连接结果。结果如下所示。
[+---+----+----+- | x| result| +---+----+----+ | 1| a b c | | 2| d e f | +---+----+---+|
答案 0 :(得分:1)
希望这有帮助!
from pyspark.sql.functions import concat_ws, collect_list, concat, coalesce, lit
#sample data
df = sc.parallelize([
[1, 'a', None, None],
[1, None, 'b', None],
[1, None, None, 'c'],
[2, 'd', None, None],
[2, None, 'e', None],
[2, None, None, 'f']]).\
toDF(('x', 'y', 'z', 'w'))
df.show()
result_df = df.groupby("x").\
agg(concat_ws(' ', collect_list(concat(*[coalesce(c, lit("")) for c in df.columns[1:]]))).
alias('result'))
result_df.show()
输出是:
+---+------+
| x|result|
+---+------+
| 1| a b c|
| 2| d e f|
+---+------+
示例输入:
+---+----+----+----+
| x| y| z| w|
+---+----+----+----+
| 1| a|null|null|
| 1|null| b|null|
| 1|null|null| c|
| 2| d|null|null|
| 2|null| e|null|
| 2|null|null| f|
+---+----+----+----+