凌乱的散点图回归线:Python

时间:2016-02-18 09:21:27

标签: python python-2.7 matplotlib scikit-learn

在python 2.7.6中,matlablib,scikit学习0.17.0,当我在散点图上创建多项式回归线时,多项式曲线将非常混乱:

enter image description here

脚本是这样的:它将读取两列浮动数据并制作散点图和回归

import pandas as pd
import scipy.stats as stats
import pylab 
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import sklearn
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

df=pd.read_csv("boston_real_estate_market_clean.csv")

LSTAT = df['LSTAT'].as_matrix()

LSTAT=LSTAT.reshape(LSTAT.shape[0], 1)

MEDV=df['MEDV'].as_matrix()

MEDV=MEDV.reshape(MEDV.shape[0], 1)

# Train test set split
X_train1, X_test1, y_train1, y_test1 =                train_test_split(LSTAT,MEDV,test_size=0.3,random_state=1)

# Ploynomial Regression-nst order

plt.scatter(X_test1, y_test1, s=10, alpha=0.3)

for degree in [1,2,3,4,5]:
    model = make_pipeline(PolynomialFeatures(degree), Ridge())
    model.fit(X_train1,y_train1)
    y_plot = model.predict(X_test1)
    plt.plot(X_test1, y_plot, label="degree %d" % degree
             +'; $q^2$: %.2f' % model.score(X_train1, y_train1)
             +'; $R^2$: %.2f' % model.score(X_test1, y_test1))


plt.legend(loc='upper right')

plt.show()

我猜原因是因为" X_test1,y_plot"没有正确排序?

X_test1是一个像这样的numpy数组:

[[  5.49]
 [ 16.65]
 [ 17.09]
 ....
 [ 25.68]
 [ 24.39]]

yplot是一个像这样的numpy数组:

[[ 29.78517812]
 [ 17.16759833]
 [ 16.86462359]
 [ 23.18680265]
...[ 37.7631725 ]]

我尝试用这个来排序:

 [X_test1, y_plot] = zip(*sorted(zip(X_test1, y_plot), key=lambda y_plot: y_plot[0]))

     plt.plot(X_test1, y_plot, label="degree %d" % degree
              +'; $q^2$: %.2f' % model.score(X_train1, y_train1)
              +'; $R^2$: %.2f' % model.score(X_test1, y_test1))

现在曲线看起来很正常,但结果很奇怪,负R ^ 2。

enter image description here

任何专家都能告诉我真正的问题是如何正确排序?谢谢!

1 个答案:

答案 0 :(得分:2)

虽然情节现在是正确的,但是在排序时你搞砸了X_test1和y_test1的配对,因为你忘了也以同样的方式对y_test1进行排序。 最好的解决方案是在拆分后立即进行排序。然后y_plot(稍后计算)将自动更正:(此处未经测试的示例使用numpy作为np)

X_train1, X_test1, y_train1, y_test1 =             train_test_split(LSTAT,MEDV,test_size=0.3,random_state=1)

sorted_index = np.argsort(X_test1)
X_test1 = X_test1[sorted_index]
y_test1 = y_test1[sorted_index]