LinearRegressionWithSGD()返回NaN

时间:2015-07-21 10:04:58

标签: machine-learning apache-spark

我试图在Million Song数据集上使用LinearRegressionWithSGD,我的模型返回NaN' s作为权重,0.0作为截距。错误可能是什么问题?我在独立模式下使用Spark 1.40。

示例数据:http://www.filedropper.com/part-00000

这是我的完整代码:

//导入依赖项

val data =  
sc.textFile("/home/naveen/Projects/millionSong/YearPredictionMSD.txt")

//定义RDD

def parsePoint (line: String): LabeledPoint = {
val x = line.split(",")
val head = x.head.toDouble
val tail = Vectors.dense(x.tail.map(x => x.toDouble))
return LabeledPoint(head,tail)
}

//转换为标记点

val parsedDataInit = data.map(x => parsePoint(x))
val onlyLabels = parsedDataInit.map(x => x.label)
val minYear = onlyLabels.min()
val maxYear = onlyLabels.max()

//查找范围

val parsedData = parsedDataInit.map(x => LabeledPoint(x.label-minYear   
,   x.features))

// Shift标签

val splits = parsedData.randomSplit(Array(0.8, 0.1, 0.1), seed = 123)
val parsedTrainData = splits(0).cache()
val parsedValData = splits(1).cache()
val parsedTestData = splits(2).cache()

val nTrain = parsedTrainData.count()
val nVal = parsedValData.count()
val nTest = parsedTestData.count()

//培训,验证和测试集

def squaredError(label: Double, prediction: Double): Double = {

return scala.math.pow(label - prediction,2)
}

def calcRMSE(labelsAndPreds: RDD[List[Double]]): Double = {
return scala.math.sqrt(labelsAndPreds.map(x =>    
           squaredError(x(0),x(1))).mean())
}
val numIterations = 100
val stepSize = 1.0
val regParam = 0.01
val regType = "L2"
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(numIterations)
.setStepSize(stepSize) 
.setRegParam(regParam)
val model = algorithm.run(parsedTrainData) 

// RMSE

{{1}}

2 个答案:

答案 0 :(得分:2)

我不熟悉SGD的这个特定实现,但通常如果梯度下降求解器转到nan,则意味着学习速率太大。 (在这种情况下,我认为它是stepSize变量。)

每次尝试将其降低一个数量级,直到它开始收敛

答案 1 :(得分:0)

我认为有两种可能性。

  1. stepSize很重要。你应该尝试0.01,0.03,0.1之类的东西, 0.3,1.0,3.0 ....
  2. 您的火车数据有NaN。如果是这样,结果可能是NaN。