从蜂巢实木复合地板表读取时无法执行广播加入

时间:2019-07-18 10:34:35

标签: scala apache-spark parquet spark-shell

我想在两个配置单元表之间执行广播连接。一个具有约300-400mb的数据,另一个具有1mb的数据。我要播放小桌子。

当我使用spark.read.table(“ tableA”)读取表时,explain方法显示sortmergeJoin。但是,当我使用spark.read.parquet(“ tableALocation”)阅读时,它显示了广播联接。

使用配置单元表sortMergeJoin执行连接:-

scala> val smallTable = spark.read.table("test.smallTable")
smallTable: org.apache.spark.sql.DataFrame = [x_col_a: double, x_col_b: double ... 49 more fields]

scala> val bigtable = spark.read.table("test.bigtable")
b: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]

scala> val joinTable = bigtable.join(smallTable,bigtable("quantile_") === smallTable("quantile_"),"inner")
c: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]

scala> joinTable.explain
== Physical Plan ==
SortMergeJoin [quantile_#7502], [quantile_#7397], Inner
:- Sort [quantile_#7502 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(quantile_#7502, 200)

但是,如果我直接阅读实木复合地板文件,则自动广播加入。

val smallTableFile = spark.read.parquet("/apps/hive/warehouse/test.db/test_1_crosssellaggregatecomponent")
smallTableFile: org.apache.spark.sql.DataFrame = [x_vce_offnet_moc_drtn_secs: double, x_sms_onnet_moc_billed_rev: double ... 49 more fields]


scala> val bigTableFile = spark.read.parquet("/apps/hive/warehouse/test.db/test_1_prevcurrentlateststringinputjoincomponent")
bigTableFile: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]


scala> val join = bigTableFile.join(smallTableFile,smallTableFile("quantile_") === bi("quantile_"),"inner")
bigTableFile   bin   bitwiseNOT

scala> val join = bigTableFile.join(smallTableFile,smallTableFile("quantile_") === bigTableFile("quantile_"),"inner")
join: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]

scala> join.explain
== Physical Plan ==
BroadcastHashJoin [quantile_#8562], [quantile_#8508], Inner, BuildRight

我还观察到,如果我坚持使用smallTable自动广播联接。


scala> val smallTable = spark.read.table("test.smallTable")
smallTable: org.apache.spark.sql.DataFrame = [x_vce_offnet_moc_drtn_secs: double, x_sms_onnet_moc_billed_rev: double ... 49 more fields]

scala> val bigTable = spark.read.table("test.bigTable")
b: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]

scala> val join = bigTable.join(smallTable,smallTable("quantile_") === bigTable("quantile_"),"inner")
join: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]

scala> join.explain
19/07/18 10:30:36 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
== Physical Plan ==
SortMergeJoin [quantile_#154], [quantile_#49], Inner
:- Sort [quantile_#154 ASC NULLS FIRST], false, 0
scala> smallTable.persist
res1: smallTable.type = [x_col_a: double, x__col_b: double ... 49 more fields]

scala> smallTable.count
res2: Long = 10

scala> join.explain
== Physical Plan ==
BroadcastHashJoin [quantile_#154], [quantile_#49], Inner, BuildRight

我知道我们可以使用sql.functions.broadcast强制广播。 我想知道为什么在直接读取拼花时不进行自动广播。

0 个答案:

没有答案