将DStream [Double,Double]转换为RDD [(Double,Double)]

时间:2016-04-27 14:37:15

标签: scala apache-spark

我需要在流数据上训练线性回归模型。我使用textFileStream读取流数据。但问题是RegressionMetrics接受RDD[(Double, Double)],而output格式为DStream[Double,Double]。 如何将output转换为RDD[(Double, Double)]以便能够使用RegressionMetrics

val model = new StreamingLinearRegressionWithSGD()
      .setInitialWeights(Vectors.dense(0.0, 0.0))
      .setStepSize(0.2)
      .setNumIterations(25)

    trainingData = ssc.textFileStream("/training/data/dir").map(LabeledPoint.parse)
testData = ssc.textFileStream("/training/data/dir").map(LabeledPoint.parse)

model.trainOn(trainingData)

val output = model.predictOnValues(testData.map(lp => (lp.label, lp.features)))

val metrics = new RegressionMetrics(output) 
val rmse = metrics.rootMeanSquaredError

1 个答案:

答案 0 :(得分:0)

每个DStream都包含一个底层RDD(每个数据批处理一个),可以使用foreachRDD方法访问:

model.predictOnValues(testData.map(lp => (lp.label, lp.features))).foreachRDD { rdd =>
  val metrics = new RegressionMetrics(rdd)
  val rmse = metrics.rootMeanSquaredError
  // do something with `rmse` here
}
相关问题