为什么对不存在(未选择)的列进行过滤?

时间:2020-01-05 06:54:46

标签: scala apache-spark

以下最小示例

val df1 = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("foo", "bar")
val df2 = df1.select($"foo")
val df3 = df2.filter($"bar" === lit("a"))

df1.printSchema
df1.show

df2.printSchema
df2.show

df3.printSchema
df3.show

运行无错误:

root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)

+---+---+
|foo|bar|
+---+---+
|  0|  a|
|  1|  b|
+---+---+

root
 |-- foo: integer (nullable = false)

+---+
|foo|
+---+
|  0|
|  1|
+---+

root
 |-- foo: integer (nullable = false)

+---+
|foo|
+---+
|  0|
+---+

但是,我希望有类似的东西

org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given input columns: [foo];

出于同样的原因,我明白了

org.apache.spark.sql.AnalysisException: cannot resolve '`asdasd`' given input columns: [foo];

当我这样做

val df4 = df2.filter($"asdasd" === lit("a"))

但是不会发生。为什么?

1 个答案:

答案 0 :(得分:2)

我倾向于将其称为错误。 explain plan可以提供更多信息:

val df1 = Seq((0, "a"), (1, "b")).toDF("foo", "bar")

df1.select("foo").where($"bar" === "a").explain(true)
// == Parsed Logical Plan ==
// 'Filter ('bar = a)
// +- Project [foo#4]
//    +- Project [_1#0 AS foo#4, _2#1 AS bar#5]
//       +- LocalRelation [_1#0, _2#1]
// 
// == Analyzed Logical Plan ==
// foo: int
// Project [foo#4]
// +- Filter (bar#5 = a)
//    +- Project [foo#4, bar#5]
//       +- Project [_1#0 AS foo#4, _2#1 AS bar#5]
//          +- LocalRelation [_1#0, _2#1]
// 
// == Optimized Logical Plan ==
// LocalRelation [foo#4]
// 
// == Physical Plan ==
// LocalTableScan [foo#4]

显然,parsed logical plananalyzed (or resolved) logical plan仍在其{{1}中(即projections)中仍由bar组成,并且过滤操作继续遵守假定的删除的列。

在相关说明中,以下查询的逻辑计划也包含被删除的列,因此表现出类似的异常:

Project nodes
相关问题