Spark线性回归中的错误拦截

时间:2016-05-05 10:40:41

标签: apache-spark regression pyspark linear-regression

我从Spark线性回归开始。我试图将线条拟合到线性数据集。似乎拦截没有正确调整,或者我可能错过了一些东西..

使用intercept = False:

linear_model = LinearRegressionWithSGD.train(labeledData, iterations=100, step=0.0001, intercept=False)

Plot with intercept=False

这似乎很正常。但是当我使用intercept = True时:

linear_model = LinearRegressionWithSGD.train(labeledData, iterations=100, step=0.0001, intercept=True)

Plot with intercept=True

我在上一个案例中得到的模型正是:

(weights=[0.0353471289751], intercept=1.0005127185289888)

我尝试过不同的数据集,步长和迭代,但总是模型收敛截距约为1

编辑 - 这是我正在使用的代码:

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
import numpy as np
import matplotlib.pyplot as plt
from pyspark import SparkContext
sc = SparkContext("local", "regression")

# Generate data
SIZE = 300
SLOPE = 0.1
BASE = -30
NOISE = 10

x = np.arange(SIZE)
delta = np.random.uniform(-NOISE,NOISE, size=(SIZE,))
y = BASE + SLOPE*x + delta
data = zip(range(len(y)), y) # zip with index
dataRDD = sc.parallelize(data)

# Normalize data
# mean = np.mean(data)
# std = np.std(data)
# dataRDD = dataRDD.map(lambda r: (r[0], (float(r[1])-mean)/std))

labeledData = dataRDD.map(lambda r: LabeledPoint(float(r[1]), [float(r[0])]))

# Create linear model
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=1000, step=0.0002, intercept=True, convergenceTol=0.000001)
print linear_model

true_vs_predicted = labeledData.map(lambda p: (p.label, linear_model.predict(p.features))).collect()

# PLOT
fig = plt.figure()
ax = fig.add_subplot(111)
ax.grid()

y_real = [x[0] for x in true_vs_predicted] 
y_pred = [x[1] for x in true_vs_predicted] 

plt.plot(range(len(y_real)), y_real, 'o', markersize=5, c='b')
plt.plot(range(len(y_pred)), y_pred, 'o', markersize=5, c='r')

plt.show()

1 个答案:

答案 0 :(得分:1)

这是因为迭代次数和步长都较小。结果,试验过程在到达当地最佳状态之前结束。