Question

我在我的程序中进行了长时间的迭代，并且我希望每隔几次迭代缓存和检查点（这种技术建议在Web上剪掉很长的谱系）所以我不会有StackOverflowError，通过这样做

for (i <- 2 to 100) {
      //cache and checkpoint ever 30 iterations
      if (i % 30 == 0) {
        graph.cache
        graph.checkpoint
        //I use numEdges in order to start the transformation I need
        graph.numEdges
      }
      //graphs are stored to a list
      //here I use the graph of previous iteration to this iteration
      //and perform a transformation
}

我已经像这样设置了检查点目录

val sc = new SparkContext(conf)
sc.setCheckpointDir("checkpoints/")

然而，当我最终运行我的程序时，我得到一个例外

Exception in thread "main" org.apache.spark.SparkException: Invalid checkpoint directory

我使用3台计算机，每台计算机都有Ubuntu 14.04，我还在每台计算机上使用预装版本的spark 1.4.1和hadoop 2.4或更高版本。

Answer 1

如果您已经在节点集群上设置了HDFS，则可以在目录HADOOP_HOME/etc/hadoop中的“core-site.xml”中找到您的hdfs地址。对我来说，core-site.xml设置为：

<configuration>
      <property>
           <name>fs.default.name</name>
           <value>hdfs://master:9000</value>
      </property>
</configuration>

然后你可以在hdfs上创建一个目录来保存Rdd检查点文件，让这个目录命名为RddChekPoint，由hadoop hdfs shell命名：

$ hadoop fs -mkdir /RddCheckPoint

如果您使用pyspark，在sc = SparkContext(conf)初始化SparkContext之后，您可以通过

设置检查点目录

sc.setCheckpointDir("hdfs://master:9000/RddCheckPoint")

当检查点为Rdd时，在hdfs目录RddCheckPoint中，您可以看到检查点文件保存在那里，看看：

$ hadoop fs -ls /RddCheckPoint

Answer 2

检查点目录必须是HDFS兼容目录（来自scala doc“HDFS兼容目录，其中将可靠地存储检查点数据。请注意，这必须是像HDFS一样的容错文件系统”）。因此，如果您在这些节点上安装了HDFS，请将其指向“hdfs：// [yourcheckpointdirectory]”。

Spark无效的检查点目录

2 个答案: