什么是“ ExistingRDD”?对查询计划不利吗?

时间:2019-04-24 20:05:10

标签: apache-spark pyspark apache-spark-sql rdd pyspark-sql

据我所见,rdd.toDF()引入了PythonRDD,它在查询计划中成为ExistingRDD

df1 = spark.range(100, numPartitions=5)
df2 = df1.rdd.toDF()

print(df1.rdd.toDebugString())
# (5) MapPartitionsRDD[2097] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2096] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2095] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2094] at javaToPython at <unknown>:0 []
#  |  ParallelCollectionRDD[2093] at javaToPython at <unknown>:0 []
print(df2.rdd.toDebugString())
# (5) MapPartitionsRDD[2132] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2131] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2130] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2129] at applySchemaToPythonRDD at <unknown>:0 []
#  |  MapPartitionsRDD[2128] at map at SerDeUtil.scala:137 []
#  |  MapPartitionsRDD[2127] at mapPartitions at SerDeUtil.scala:184 []
#  |  PythonRDD[2126] at RDD at PythonRDD.scala:53 []
#  |  MapPartitionsRDD[2097] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2096] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2095] at javaToPython at <unknown>:0 []
#  |  MapPartitionsRDD[2094] at javaToPython at <unknown>:0 []
#  |  ParallelCollectionRDD[2093] at javaToPython at <unknown>:0 []

如果我使用DataFrame缓存df1.cache(),spark SQL足够聪明,可以在查询中使用等效的RDD。

spark.range(100, numPartitions=5).groupby().count().explain()
# == Physical Plan ==
# *(2) HashAggregate(keys=[], functions=[count(1)])
# +- Exchange SinglePartition
#    +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
#       +- *(1) InMemoryTableScan
#             +- InMemoryRelation [id#2525L], StorageLevel(disk, memory, deserialized, 1 replicas)
#                   +- *(1) Range (0, 100, step=1, splits=5)

但是,ExistingRDD并没有从中受益。

df2.groupby().count().explain()
# == Physical Plan ==
# *(2) HashAggregate(keys=[], functions=[count(1)])
# +- Exchange SinglePartition
#    +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
#       +- *(1) Project
#          +- Scan ExistingRDD[id#2573L]

Spark SQL优化器似乎无法通过ExistingRDD进行跟踪。是真的吗?

如果我使用df1.rdd.cache().count()是因为df2.rdddf1.rdd的后代,那么它仍然可以从RDD缓存中受益吗?

我还想知道如果ExistingRDD会给查询计划带来障碍,从而对性能造成不利影响,则会形成什么操作。

0 个答案:

没有答案