使用太多资源的Spark作业

时间:2016-01-11 07:45:33

标签: scala apache-spark apache-spark-mllib

我正在对50个纱线集群的容器进行交叉验证研究。数据大约600,000行。

该作业大部分时间都运行良好,但在群集的驱动程序服务器(启动作业的计算机)上使用了大量RAM和CPU资源:3个4个CPU核心。但是,我不能使用那么多资源,因为这个服务器被几个人使用。

我的问题是:

  1. 为什么我的代码在驱动程序上使用了这么多资源?
  2. 我如何修改它以便在驱动程序上消耗更少的资源,在集群节点上消耗更多资源?
  3. 我不知道多少火花,但我对第一个问题的赌注是我应该使用比阵列和ParArrays更多的RDDS,但我无法弄清楚如何......

    这是我的代码:

    val sc: SparkContext = new SparkContext()
      .setMaster( "yarn-client" )
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.kryo.registrator", "com.amadeus.ssp.tools.SSPKryoRegistrator")
    
    val data = sc.textFile("...").map(some pre-treatment...)
    
    // Parameters 
      val numModels = Array(5)
      val trainingRatioMajMin = 0.7
      // Tree Ensemble
      val numTrees = Array(50)
      val maxDepth = Array(30)
      val maxBins = Array(100)
      // RF
      val featureSubsetStrategy = Array("sqrt")
      val subsamplingRate = Array(1.0)
    
    // Class for model
    class Model(model: Array[RandomForestModel]) {
      def predict(data:RDD[Vector]) : RDD[Double] = {
        data.map(p => predict(p))
      }
      def predictWithLabels(data:RDD[LabeledPoint]) : RDD[(Double, Double)] = {
        data.map(p => (p.label, predict(p.features)))
      }
      def predict(point:Vector): Double = {
        model.map(m => m.predict(point)).sum / model.length
      }
    }
    
     val CV_params: Array[Array[Any]] = {
        for (a <- numTrees; b <- maxDepth; c <- maxBins; d <- featureSubsetStrategy;
            e <- subsamplingRate; f <- numModels) yield Array(a, b, c, d, e, f)
      }
    
     // Sampling dataset 
     def sampling(dataset:RDD[LabeledPoint], fraction :Double): (Array[RDD[LabeledPoint]], RDD[LabeledPoint]) = {
        logInfo("Begin Sampling")
        val dataset_maj = dataset filter(_.label == 0.0)
        val dataset_min = dataset filter(_.label == 1.0)
        dataset_maj.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)
    
        val data =  ((1 to params(5).asInstanceOf[String]).map { sample =>
          dataset_maj.sample(false, fraction)
        }.toArray, dataset_min)
        dataset_maj.unpersist()
        logInfo("End Sampling")
        data
      }
    
    // Train the model
      def classificationModel(params:Array[Any], training:RDD[LabeledPoint]) : EasyEnsemble_Classifier_Model = {
        val (samples, data_min) = sampling(data, fraction)
        data_min.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
        val models = samples.par.map { sample =>
          val strategy = new Strategy(Algo.Classification, org.apache.spark.mllib.tree.impurity.Gini, params(1).asInstanceOf[Int],
          numClasses, params(2).asInstanceOf[Int],  QuantileStrategy.Sort, categoricalFeaturesInfo, 1, 0.0, 256, params(4).asInstanceOf[Double], false, 10)
          val model = RandomForest.trainClassifier(sample ++ data_min, strategy, params(0).asInstanceOf[Int], params(3).asInstanceOf[String], 0)
          logInfo(s"RF - totalNumNodes: ${model.totalNumNodes} - numTrees: ${model.numTrees}")
          model
        }.toArray
        data_min.unpersist()
        logInfo(s"RF: End RF training\n")
        new Model(models)
      }
    
    
     ///// Cross-validation
    val cv_data:Array[(RDD[LabeledPoint], RDD[LabeledPoint])] = MLUtils.kFold(data, numFolds, 0)
    
    logInfo("Begin cross-validation")
    val result : Array[(Double, Double)] = cv_data.par.map{case (training, validation) =>
      training.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
      validation.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
    
      val res :ParArray[(Double, Double)] = CV_params.par.zipWithIndex.map { case (p,i) =>
        // Training classifier
        val model = classificationModel(p, training)
        // Prediction
        val labelAndPreds = model.predictWithLabels(validation)
        // Metrics computation
        val bcm = new BinaryClassificationMetrics(labelAndPreds)
        logInfo("ROC: %s".format(bcm.roc().collect().map(_.toString).reduce(_ + " - " + _)))
        logInfo("PR: %s".format(bcm.pr().collect().map(_.toString).reduce(_ + " - " + _)))
        logInfo("auPR: %s".format(bcm.areaUnderPR().toString))
        logInfo("fMeasure: %s".format(bcm.fMeasureByThreshold().collect().map(_.toString).reduce(_ + " - " + _)))
        (bcm.areaUnderROC() / numFolds, bcm.areaUnderPR() / numFolds)
      }
    
      training.unpersist()
      validation.unpersist()
      res
    }.reduce((s1,s2) => s1.zip(s2).map(t => (t._1._1 + t._2._1, t._1._2 + t._2._2))).toArray
    
    val cv_roc = result.map(_._1)
    val cv_pr = result.map(_._2)
    
    logInfo("areaUnderROC: %s".format(cv_roc.map(_.toString).reduce( _ + " - " + _)))
    logInfo("areaUnderPR: %s".format(cv_pr.map(_.toString).reduce( _ + " - " + _)))
    
    // Extract best params
    val which_max = (metric match {
      case "ROC" => cv_roc
      case "PR" => cv_pr
      case _ =>
        logWarning("Metrics set to default one: PR")
        cv_pr
    }).zipWithIndex.maxBy(_._1)._2
    
    best_values_array = CV_params(which_max)
    CV_areaUnderROC = cv_roc
    CV_areaUnderPR = cv_pr
    

    修改

    我用

    启动它
    spark-submit \
        --properties-file spark.properties \
        --class theClass \
        --master yarn-client \
        --num-executors 50 \
        job.jar 
    

    spark.properties

    spark.rdd.compress               true
    spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
    org.apache.spark.io.SnappyCompressionCodec
    
    spark.yarn.maxAppAttempts   1
    yarn.log-aggregation-enable true
    
    spark.executor.memory              5g
    spark.yarn.executor.memoryOverhead 1024
    
    spark.cassandra.connection.host            172.16.110.94
    spark.cassandra.connection.timeout_ms      60000
    spark.cassandra.connection.compression     SNAPPY
    

2 个答案:

答案 0 :(得分:0)

运行作业时,您可以限制作业可以访问驱动程序的可用内核和内存:

parse_ini_file()

您还可以在使用纱线时更改可用的执行器数量。

请参阅Spark Configuration Options文档。

[edit]

我刚才注意到我无法在代码中的任何地方看到您使用Spark的地方。我没有看到任何Spark上下文或您在集合(数组)上调用并行化的任何地方。如果不这样做,它们将不会被并行分发和处理。我自己对Spark很陌生,但我不知道你的代码如何使用Spark,除非这只是它的一小部分。

答案 1 :(得分:0)

I will tell you why you are eating Driver resources with collect() API.

As per Apache documentation:

collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

If you have N number of partitions (in your case 50 containers) , data from all partitions will be collected at Driver.

If you have large set of data, collect() may cause OutOfMemory error at Driver program.

Have a look at some of useful questions on how to handle this scenario:

Spark: Best practice for retrieving big data from RDD to local machine

Spark application fine tuning from cloudera blog