将数据帧列表转换为具有Scala中特定列的单个数据帧

时间:2017-05-05 15:53:15

标签: scala dataframe spark-dataframe

我试图将数据帧列表转换为单个数据帧,如下所示 其中dfList是List [sql.Dataframe]

dfList=List([ID: bigint, A: string], [ID: bigint, B: string], [ID: bigint, C: string], [ID: bigint, D: string])

dfList = List( +--------+-------------+  +--------+-------------+ +--------+--------+ +--------+--------+
               |    ID  |     A       |   ID      |     B       | |   ID   |     C  | |   ID   |   D    |
               +--------+-------------+  +--------+-------------+ +--------+--------+ +--------+--------+
               |    9574|            F|  |    9574|       005912| |    9574| 2016022| |    9574|      VD|
               |    9576|            F|  |    9576|       005912| |    9576| 2016022| |    9576|      VD|
               |    9578|            F|  |    9578|       005912| |    9578| 2016022| |    9578|      VD|
               |    9580|            F|  |    9580|       005912| |    9580| 2016022| |    9580|      VD|
               |    9582|            F|  |    9582|       005912| |    9582| 2016022| |    9582|      VD|
               +--------+-------------+, +--------+-------------+,+--------+--------+,+--------+--------+ )

例外输出

+--------+-------------+----------+--------+-------+
|   ID   |     A       |      B   |  C     |  D    |
+--------+-------------+----------+--------+-------+
|    9574|            F|    005912| 2016022|     00|
|    9576|            F|    005912| 2016022|     01|
|    9578|            F|    005912| 2016022|     20|
|    9580|            F|    005912| 2016022|     19|
|    9582|            F|    005912| 2016022|     89|
+--------+-------------+----------+--------+-------+

2 个答案:

答案 0 :(得分:3)

您需要将foldLeftjoin一起使用。

生成数据

scala> val dfList = ('a' to 'd').map(col => (1 to 5).zip(col.toInt to col.toInt + 4).toDF("ID", col.toString)).toList
dfList: List[org.apache.spark.sql.DataFrame] = List([ID: int, a: int], [ID: int, b: int], [ID: int, c: int], [ID: int, d: int])

这给了我以下DataFrames:

+---+---+   +---+---+   +---+---+   +---+---+
| ID|  a|   | ID|  b|   | ID|  c|   | ID|  d|
+---+---+   +---+---+   +---+---+   +---+---+
|  1| 97|   |  1| 98|   |  1| 99|   |  1|100|
|  2| 98|   |  2| 99|   |  2|100|   |  2|101|
|  3| 99|   |  3|100|   |  3|101|   |  3|102|
|  4|100|   |  4|101|   |  4|102|   |  4|103|
|  5|101|   |  5|102|   |  5|103|   |  5|104|
+---+---+   +---+---+   +---+---+   +---+---+

加入DataFrames

scala> val joinedDF = dfList.tail.foldLeft(dfList.head)((accDF, newDF) => accDF.join(newDF, Seq("ID")))
joinedDF: org.apache.spark.sql.DataFrame = [ID: int, a: int ... 3 more fields]

scala> joinedDF.show
+---+---+---+---+---+
| ID|  a|  b|  c|  d|
+---+---+---+---+---+
|  1| 97| 98| 99|100|
|  2| 98| 99|100|101|
|  3| 99|100|101|102|
|  4|100|101|102|103|
|  5|101|102|103|104|
+---+---+---+---+---+

在Scala中,fold是一种将集合缩减为单个元素的方法。在这种情况下,我们从列表的头部(dfList.head)开始,然后将列表尾部的每个元素(dfList.tail)连接在一起,以获得一个最终的DataFrame。 accDF是累积的DataFrame(从"迭代"到#34;迭代")传递,然后newDF是要添加的下一个或新的DataFrame。

有关fold工作原理的更多示例,请参阅herehere

答案 1 :(得分:1)

@ evan058提供了一个有效的解决方案,但我想补充一点reduce可能是parallelized operations的更好选择:

val joinedDF = dfList.reduce((accDF, nextDF) => accDF.join(nextDF, Seq("ID")))