RDD沿袭缓存

时间:2016-01-25 22:14:20

标签: apache-spark rdd

如果是RDD,我很难理解血统。例如

让我们说我们有这个血统:

hadoopRDD(location) <-depends- filteredRDD(f:A->Boolean) <-depends- mappedRDD(f:A->B)

如果我们坚持第一个RDD并且在一些行动之后我们将其解决。这会影响其他人依赖RDD吗?如果是的话,怎么能避免这种情况?

我的观点是,如果我们将父RDD取消,那么此操作会从子RDD中删除分区吗?

1 个答案:

答案 0 :(得分:0)

让我们来看一个例子。这将在一个分区中创建一个具有Seq of Int的RDD。一个分区的原因只是为了保持对示例其余部分的排序。

scala> val seq = Seq(1,2,3,4,5)
seq: Seq[Int] = List(1, 2, 3, 4, 5)

scala> val rdd = sc.parallelize(seq, 1)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:23

现在让我们创建两个新的RDD,它们是原始的映射版本:

scala> val firstMappedRDD = rdd.map { case i => println(s"firstMappedRDD  calc for $i"); i * 2 }
firstMappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:25

scala> firstMappedRDD.toDebugString
res25: String = 
(1) MapPartitionsRDD[12] at map at <console>:25 []
 |  ParallelCollectionRDD[11] at parallelize at <console>:23 []

scala> val secondMappedRDD = firstMappedRDD.map { case i => println(s"secondMappedRDD calc for $i"); i * 2 }
secondMappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at map at <console>:27

scala> secondMappedRDD.toDebugString
res26: String = 
(1) MapPartitionsRDD[13] at map at <console>:27 []
 |  MapPartitionsRDD[12] at map at <console>:25 []
 |  ParallelCollectionRDD[11] at parallelize at <console>:23 []

我们可以使用toDebugString查看谱系。我在每个地图步骤中添加了println,以便在调用map时清楚显示。让我们收集每个RDD,看看会发生什么:

scala> firstMappedRDD.collect()
firstMappedRDD  calc for 1
firstMappedRDD  calc for 2
firstMappedRDD  calc for 3
firstMappedRDD  calc for 4
firstMappedRDD  calc for 5
res27: Array[Int] = Array(2, 4, 6, 8, 10)

scala> secondMappedRDD.collect()
firstMappedRDD  calc for 1
secondMappedRDD calc for 2
firstMappedRDD  calc for 2
secondMappedRDD calc for 4
firstMappedRDD  calc for 3
secondMappedRDD calc for 6
firstMappedRDD  calc for 4
secondMappedRDD calc for 8
firstMappedRDD  calc for 5
secondMappedRDD calc for 10
res28: Array[Int] = Array(4, 8, 12, 16, 20)

正如您所料,当我们致电secondMappedRDD.collect()时,会再次调用第一步的地图。所以现在让我们cache第一个映射的RDD。

scala> firstMappedRDD.cache()
res29: firstMappedRDD.type = MapPartitionsRDD[12] at map at <console>:25

scala> secondMappedRDD.toDebugString
res31: String = 
(1) MapPartitionsRDD[13] at map at <console>:27 []
 |  MapPartitionsRDD[12] at map at <console>:25 []
 |  ParallelCollectionRDD[11] at parallelize at <console>:23 []

scala> firstMappedRDD.count()
firstMappedRDD  calc for 1
firstMappedRDD  calc for 2
firstMappedRDD  calc for 3
firstMappedRDD  calc for 4
firstMappedRDD  calc for 5
res32: Long = 5

scala> secondMappedRDD.toDebugString
res33: String = 
(1) MapPartitionsRDD[13] at map at <console>:27 []
 |  MapPartitionsRDD[12] at map at <console>:25 []
 |      CachedPartitions: 1; MemorySize: 120.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 |  ParallelCollectionRDD[11] at parallelize at <console>:23 []

在第一个映射的结果位于缓存中之后,第二个映射的RDD的谱系具有其第一个谱系的缓存结果。现在让我们打电话给collect

scala> secondMappedRDD.collect
secondMappedRDD calc for 2
secondMappedRDD calc for 4
secondMappedRDD calc for 6
secondMappedRDD calc for 8
secondMappedRDD calc for 10
res34: Array[Int] = Array(4, 8, 12, 16, 20)

现在让我们unpersist再次致电collect

scala> firstMappedRDD.unpersist()
res36: firstMappedRDD.type = MapPartitionsRDD[12] at map at <console>:25

scala> secondMappedRDD.toDebugString
res37: String = 
(1) MapPartitionsRDD[13] at map at <console>:27 []
 |  MapPartitionsRDD[12] at map at <console>:25 []
 |  ParallelCollectionRDD[11] at parallelize at <console>:23 []

scala> secondMappedRDD.collect
firstMappedRDD  calc for 1
secondMappedRDD calc for 2
firstMappedRDD  calc for 2
secondMappedRDD calc for 4
firstMappedRDD  calc for 3
secondMappedRDD calc for 6
firstMappedRDD  calc for 4
secondMappedRDD calc for 8
firstMappedRDD  calc for 5
secondMappedRDD calc for 10
res38: Array[Int] = Array(4, 8, 12, 16, 20)

因此,当我们collect第一个映射的RDD在第一个映射后的RDD的结果未被存在时,第一个映射将再次被调用。

如果源是HDFS或任何其他存储,则会再次从源中检索数据。