Question

以下代码引发“检测到的用于INNER联接的笛卡尔积”异常：

first_df = spark.createDataFrame([{"first_id": "1"}, {"first_id": "1"}, {"first_id": "1"}, ])
second_df = spark.createDataFrame([{"some_value": "????"}, ])

second_df = second_df.withColumn("second_id", F.lit("1"))

# If the next line is uncommented, then the JOIN is working fine.
# second_df.persist()

result_df = first_df.join(second_df,
                          first_df.first_id == second_df.second_id,
                          'inner')
data = result_df.collect()

result_df.explain()

并告诉我逻辑计划如下所示：

Filter (first_id#0 = 1)
+- LogicalRDD [first_id#0], false
and
Project [some_value#2, 1 AS second_id#4]
+- LogicalRDD [some_value#2], false
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;

当RuleExecutor应用称为CheckCartesianProducts的优化规则集（请参阅https://github.com/apache/spark/blob/v2.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1114）时，对于这些逻辑计划，JOIN条件中似乎不存在任何列。

但是，如果我在加入之前使用“ persist”方法，则该方法有效并且物理计划为：

*(3) SortMergeJoin [first_id#0], [second_id#4], Inner
:- *(1) Sort [first_id#0 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(first_id#0, 10)
:     +- Scan ExistingRDD[first_id#0]
+- *(2) Sort [second_id#4 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(second_id#4, 10)
      +- InMemoryTableScan [some_value#2, second_id#4]
            +- InMemoryRelation [some_value#2, second_id#4], true, 10000, StorageLevel(disk, memory, 1 replicas)
                  +- *(1) Project [some_value#2, 1 AS second_id#4]
                     +- Scan ExistingRDD[some_value#2]

因此，也许有人可以解释导致这种结果的内部原因，因为持久保存数据帧并不是解决方案。

Answer 1

问题在于，一旦持久化数据，second_id就会合并到缓存表中，而不再被视为常量。结果，计划人员无法再推断该查询应表示为笛卡尔积，而在散列分区的SortMergeJoin上使用标准second_id。

使用udf

在没有持久性的情况下实现相同的结果将是微不足道的

from pyspark.sql.functions import lit, pandas_udf, PandasUDFType @pandas_udf('integer', PandasUDFType.SCALAR) def identity(x): return x second_df = second_df.withColumn('second_id', identity(lit(1))) result_df = first_df.join(second_df, first_df.first_id == second_df.second_id, 'inner') result_df.explain()

== Physical Plan ==
*(6) SortMergeJoin [cast(first_id#4 as int)], [second_id#129], Inner
:- *(2) Sort [cast(first_id#4 as int) ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(cast(first_id#4 as int), 200)
:     +- *(1) Filter isnotnull(first_id#4)
:        +- Scan ExistingRDD[first_id#4]
+- *(5) Sort [second_id#129 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(second_id#129, 200)
      +- *(4) Project [some_value#6, pythonUDF0#154 AS second_id#129]
         +- ArrowEvalPython [identity(1)], [some_value#6, pythonUDF0#154]
            +- *(3) Project [some_value#6]
               +- *(3) Filter isnotnull(pythonUDF0#153)
                  +- ArrowEvalPython [identity(1)], [some_value#6, pythonUDF0#153]
                     +- Scan ExistingRDD[some_value#6]

但是SortMergeJoin不是什么，您应该在这里尝试实现。如果使用恒定的键，则除了玩具数据以外的任何数据都将导致极端的数据偏斜，并且可能会失败。

笛卡尔积，尽管价格昂贵，但不会受到这个问题的影响，因此应首选。因此，建议您启用交叉连接或使用显式交叉连接语法（spark.sql.crossJoin.enabled for Spark 2.x）并继续。

一个尚待解决的问题仍然是如何防止在缓存数据时出现不良行为。不幸的是我还没有答案。我相当确定可以使用自定义优化程序规则，但这并不是仅靠Python就能做到的。

在PySpark的文字列中检测到INNER的笛卡尔积

1 个答案: