Question

我想在两个配置单元表之间执行广播连接。一个具有约300-400mb的数据，另一个具有1mb的数据。我要播放小桌子。

当我使用spark.read.table（“ tableA”）读取表时，explain方法显示sortmergeJoin。但是，当我使用spark.read.parquet（“ tableALocation”）阅读时，它显示了广播联接。

使用配置单元表sortMergeJoin执行连接：-

scala> val smallTable = spark.read.table("test.smallTable")
smallTable: org.apache.spark.sql.DataFrame = [x_col_a: double, x_col_b: double ... 49 more fields]

scala> val bigtable = spark.read.table("test.bigtable")
b: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]

scala> val joinTable = bigtable.join(smallTable,bigtable("quantile_") === smallTable("quantile_"),"inner")
c: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]

scala> joinTable.explain
== Physical Plan ==
SortMergeJoin [quantile_#7502], [quantile_#7397], Inner
:- Sort [quantile_#7502 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(quantile_#7502, 200)

但是，如果我直接阅读实木复合地板文件，则自动广播加入。

val smallTableFile = spark.read.parquet("/apps/hive/warehouse/test.db/test_1_crosssellaggregatecomponent")
smallTableFile: org.apache.spark.sql.DataFrame = [x_vce_offnet_moc_drtn_secs: double, x_sms_onnet_moc_billed_rev: double ... 49 more fields]


scala> val bigTableFile = spark.read.parquet("/apps/hive/warehouse/test.db/test_1_prevcurrentlateststringinputjoincomponent")
bigTableFile: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]


scala> val join = bigTableFile.join(smallTableFile,smallTableFile("quantile_") === bi("quantile_"),"inner")
bigTableFile   bin   bitwiseNOT

scala> val join = bigTableFile.join(smallTableFile,smallTableFile("quantile_") === bigTableFile("quantile_"),"inner")
join: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]

scala> join.explain
== Physical Plan ==
BroadcastHashJoin [quantile_#8562], [quantile_#8508], Inner, BuildRight

我还观察到，如果我坚持使用smallTable自动广播联接。


scala> val smallTable = spark.read.table("test.smallTable")
smallTable: org.apache.spark.sql.DataFrame = [x_vce_offnet_moc_drtn_secs: double, x_sms_onnet_moc_billed_rev: double ... 49 more fields]

scala> val bigTable = spark.read.table("test.bigTable")
b: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]

scala> val join = bigTable.join(smallTable,smallTable("quantile_") === bigTable("quantile_"),"inner")
join: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]

scala> join.explain
19/07/18 10:30:36 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
== Physical Plan ==
SortMergeJoin [quantile_#154], [quantile_#49], Inner
:- Sort [quantile_#154 ASC NULLS FIRST], false, 0
scala> smallTable.persist
res1: smallTable.type = [x_col_a: double, x__col_b: double ... 49 more fields]

scala> smallTable.count
res2: Long = 10

scala> join.explain
== Physical Plan ==
BroadcastHashJoin [quantile_#154], [quantile_#49], Inner, BuildRight

我知道我们可以使用sql.functions.broadcast强制广播。我想知道为什么在直接读取拼花时不进行自动广播。

从蜂巢实木复合地板表读取时无法执行广播加入

0 个答案: