Spark DataFrames-使用cogroup进行完全外部联接的替代方法

时间:2019-05-09 00:57:08

标签: apache-spark apache-spark-sql

我正在尝试解决我的一个Spark作业的性能问题,并且我认为我在使用“ cogroup”函数时遇到了问题。我试图将两个数据帧组合在一起(两个数据帧都很大,因此都无法广播),并且简单的连接不起作用,因为我需要添加很多额外的处理逻辑。

这是两个数据框的示例

交易:

+---------+-----------------+--------+
| CardNum | TransactionTime | Amount |
+---------+-----------------+--------+
| ABC     |        20190101 |   10.0 |
| ABC     |        20180501 |   25.0 |
| DEF     |        20181201 |   30.0 |
| ghi     |        20180101 |   20.0 |
+---------+-----------------+--------+

查阅:

+---------+------------+-----------------+------------------+-------------+
| CardID  | InternalId | RecordStartDate | RecordExpiryDate | AnotherCode |
+---------+------------+-----------------+------------------+-------------+
| abc     |      10001 | 2018-01-01      | 2018-05-20       | A           |
| def     |      10002 | 2018-01-01      | 9999-12-31       | A           |
| def     |      10005 | 2018-01-01      | 9999-12-31       | B           |
| ghi     |      10003 | 2018-01-01      | 9999-12-31       | B           |
| abc     |      20001 | 2018-05-20      | 9999-12-31       | A           |
+---------+------------+-----------------+------------------+-------------+

预期结果:

+---------+-----------------+--------+------------+--------------------------------------------------------------+
| CardNum | TransactionTime | Amount | InternalID |                    Additional Explanation                    |
+---------+-----------------+--------+------------+--------------------------------------------------------------+
| ABC     |        20190101 | 10.0   |       2001 | For this txn time, this internal id matches                  |
| ABC     |        20180501 | 25.0   |       1001 | For an older txn and same card as above, the older id matches         |
| DEF     |        20181201 | 30.0   |       1002 | If two results are valid, pick the internal id with code "A" |
| ghi     |        20180101 | 20.0   |       1003 | Since only one match, keep the returned id                   |
+---------+-----------------+--------+------------+--------------------------------------------------------------+

我当前如何加入数据:

// Conversion to lowercase is needed because the grouping needs to ignore case
transactionsGroupedDF = transactionsDF.groupByKey(item => item.getAs[String]("CardNum").toLowerCase)
lookupGroupedDF = lookupDF.groupByKey(item => item.getAs[String]("CardID").toLowerCase)


val resultDF = transactionsGroupedDF.cogroup(lookupGroupedDF) {
   case (key, iter1, iter2) =>
     val txnDataList = iter1.toList
     val lookupList = iter2.toList

     txnData.map(item => resolveInternalId(item, lookupTables, key))
}RowEncoder(transactionDF.schema.add("internalID","String)

我认为我需要在此处进行联合小组讨论,因为我确实需要两种情况的数据,特别是因为需要使用正确的内部ID来丰富交易,因为:交易日期,围绕“ AnotherCode”的业务规则,以及在无法解析内部ID时编写异常值的功能。

我相信代码可以按预期工作,但是我只是担心我没有以最佳方式进行此转换。多个groupByKey呼叫使我感到担忧,对cogroup的呼叫也使我担心,因为我不是100%熟悉它。 任何反馈将不胜感激。谢谢!

0 个答案:

没有答案