Question

我一直认为数据集/数据帧API是相同的......唯一的区别是数据集API将为您提供编译时安全性。对吗？

所以..我的案子非常简单：

 case class Player (playerID: String, birthYear: Int)

 val playersDs: Dataset[Player] = session.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")
  .csv(PeopleCsv)
  .as[Player]

 // Let's try to find players born in 1999. 
 // This will work, you have compile time safety... but it will not use predicate pushdown!!!
 playersDs.filter(_.birthYear == 1999).explain()

 // This will work as expected and use predicate pushdown!!!
 // But you can't have compile time safety with this :(
 playersDs.filter('birthYear === 1999).explain()

第一个例子中的解释将表明它没有进行谓词下推（注意空PushedFilters）：

== Physical Plan ==
*(1) Filter <function1>.apply
+- *(1) FileScan csv [...] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:People.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<playerID:string,birthYear:int,birthMonth:int,birthDay:int,birthCountry:string,birthState:s...

虽然第二个样本会正确执行（Notice PushedFilters）：

== Physical Plan ==
*(1) Project [.....]
+- *(1) Filter (isnotnull(birthYear#11) && (birthYear#11 = 1999))
   +- *(1) FileScan csv [...] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:People.csv], PartitionFilters: [], PushedFilters: [IsNotNull(birthYear), EqualTo(birthYear,1999)], ReadSchema: struct<playerID:string,birthYear:int,birthMonth:int,birthDay:int,birthCountry:string,birthState:s...

所以问题是..我怎样才能使用DS Api，并且编译时安全..，谓词下推按预期工作????

有可能吗？如果不是..这是否意味着DS api为您提供编译时安全性......但是以性能为代价!! ??? （在这种情况下DF会快得多......特别是在处理大型镶木地板文件时）

Answer 1

您的物理规划中的这一行应该记住了Dataset[T]和DataFrame之间的真正区别（Dataset[Row]）。

Filter <function1>.apply

我一直说人们应该远离类型化的数据集API并继续使用无类型的DataFrame API，因为Scala代码在很多地方成为优化器的黑盒子。您只需点击其中一个，并考虑Spark SQL远离JVM以避免GC的所有对象的反序列化。每次触摸对象时，您都要求Spark SQL反序列化对象并将其加载到JVM上，从而给GC带来很大压力（与非类型化DataFrame API相比，使用类型化数据集API会更频繁地触发）。 p>

请参阅UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice。

引用Reynold Xin after I asked the very same question on dev@spark.a.o mailing list：

UDF是一个黑盒子，因此Spark无法知道它正在处理什么。那里在我们可以分析UDF字节代码并推断出什么的简单情况它正在做，但一般来说很难做到。

对于此类案件SPARK-14083 Analyze JVM bytecode and turn closures into Catalyst expressions有一张JIRA票，但正如有人所说（我认为是推特上的Adam B.），很快就会有一种玩笑。

数据集API的一大优势是类型安全性，但由于严重依赖用户定义的闭包/ lambda而导致性能损失。这些闭包通常比表达式慢，因为我们可以更灵活地优化表达式（已知数据类型，无虚函数调用等）。在许多情况下，查看这些闭包的字节代码并弄清楚他们想要做什么实际上并不是很困难。如果我们能够理解它们，那么我们可以将它们直接转换为Catalyst表达式，以实现更优化的执行。

// Let's try to find players born in 1999. 
// This will work, you have compile time safety... but it will not use predicate pushdown!!!
playersDs.filter(_.birthYear == 1999).explain()

以上代码等同于以下内容：

val someCodeSparkSQLCannotDoMuchOutOfIt = (p: Player) => p.birthYear == 1999
playersDs.filter(someCodeSparkSQLCannotDoMuchOutOfIt).explain()

someCodeSparkSQLCannotDoMuchOutOfIt正是放置优化的地方，让Spark Optimizer跳过它。

为什么谓词下推没有在类型化数据集API中使用（与非类型化数据框架API相比）？

1 个答案: