Question

我有一个数据框，其中包含一行和多列。一些列是单个值，其他列是列表。所有列表列的长度都相同。我想将每个列表列拆分为一个单独的行，同时保留任何非列表列。

样本DF：

from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode

sqlc = SQLContext(sc)

df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d='foo')])
# +---+---------+---------+---+
# |  a|        b|        c|  d|
# +---+---------+---------+---+
# |  1|[1, 2, 3]|[7, 8, 9]|foo|
# +---+---------+---------+---+

我想要的是什么：

+---+---+----+------+
|  a|  b|  c |    d |
+---+---+----+------+
|  1|  1|  7 |  foo |
|  1|  2|  8 |  foo |
|  1|  3|  9 |  foo |
+---+---+----+------+

如果我只有一个列表列，只需执行explode：

即可

df_exploded = df.withColumn('b', explode('b'))
# >>> df_exploded.show()
# +---+---+---------+---+
# |  a|  b|        c|  d|
# +---+---+---------+---+
# |  1|  1|[7, 8, 9]|foo|
# |  1|  2|[7, 8, 9]|foo|
# |  1|  3|[7, 8, 9]|foo|
# +---+---+---------+---+

但是，如果我还尝试explode c列，我最终会得到一个数据框，其长度为我想要的正方形：

df_exploded_again = df_exploded.withColumn('c', explode('c'))
# >>> df_exploded_again.show()
# +---+---+---+---+
# |  a|  b|  c|  d|
# +---+---+---+---+
# |  1|  1|  7|foo|
# |  1|  1|  8|foo|
# |  1|  1|  9|foo|
# |  1|  2|  7|foo|
# |  1|  2|  8|foo|
# |  1|  2|  9|foo|
# |  1|  3|  7|foo|
# |  1|  3|  8|foo|
# |  1|  3|  9|foo|
# +---+---+---+---+

我想要的是 - 对于每一列，取该列中数组的第n个元素并将其添加到新行。我已经尝试在数据框中的所有列中映射一个爆炸，但这似乎也不起作用：

df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()

Answer 1

Spark＆gt; = 2.4

您可以将zip_ udf替换为arrays_zip函数

from pyspark.sql.functions import arrays_zip, col

(df
    .withColumn("tmp", arrays_zip("b", "c"))
    .withColumn("tmp", explode("tmp"))
    .select("a", col("tmp.b"), col("tmp.c"), "d"))

Spark＆lt; 2.4

使用DataFrames和UDF：

from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType from pyspark.sql.functions import col, udf, explode zip_ = udf( lambda x, y: list(zip(x, y)), ArrayType(StructType([ # Adjust types to reflect data types StructField("first", IntegerType()), StructField("second", IntegerType()) ])) ) (df .withColumn("tmp", zip_("b", "c")) # UDF output cannot be directly passed to explode .withColumn("tmp", explode("tmp")) .select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))

使用RDDs：

(df .rdd .flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)]) .toDF(["a", "b", "c", "d"]))

由于Python通信开销，这两种解决方案效率都很低。如果数据大小是固定的，您可以执行以下操作：

from functools import reduce from pyspark.sql import DataFrame # Length of array n = 3 # For legacy Python you'll need a separate function # in place of method accessor reduce( DataFrame.unionAll, (df.select("a", col("b").getItem(i), col("c").getItem(i), "d") for i in range(n)) ).toDF("a", "b", "c", "d")

甚至：

from pyspark.sql.functions import array, struct # SQL level zip of arrays of known size # followed by explode tmp = explode(array(*[ struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c")) for i in range(n) ])) (df .withColumn("tmp", tmp) .select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))

与UDF或RDD相比，这应该明显更快。通用以支持任意数量的列：

# This uses keyword only arguments # If you use legacy Python you'll have to change signature # Body of the function can stay the same def zip_and_explode(*colnames, n): return explode(array(*[ struct(*[col(c).getItem(i).alias(c) for c in colnames]) for i in range(n) ])) df.withColumn("tmp", zip_and_explode("b", "c", n=3))

Answer 2

一个衬板（用于 Spark> = 2.4.0 ）：

df.withColumn("bc", arrays_zip("b","c"))
  .select("a", explode("bc").alias("tbc"))
  .select("a", col"tbc.b", "tbc.c").show()

需要导入：

from pyspark.sql.functions import arrays_zip

步骤-

创建一个列bc，它是列array_zip和b的{{1}}
展开c以获得结构bc
选择所需的列tbc，a和b（均根据需要展开）。

Pyspark：将多个数组列拆分为行

2 个答案:

一个衬板（用于 Spark> = 2.4.0 ）：

需要导入：

步骤-

输出：