如何使用随机森林训练和预测模型?

时间:2017-07-11 09:58:56

标签: python pandas dataframe scikit-learn random-forest

我们如何使用random forest预测模型?我想训练一个模型,最后使用random forest model in Python的{​​{1}}来预测真值:(点击链接下载完整的CSV - 数据集格式如下所示

t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10

我想使用Y的最后一个(例如:5,10,100,300,1000,..等)数据点来预测X(真实值)的当前值在random forest model中使用sklearn的{​​{1}}。意味着将Python列的[0,0,1,2,3]作为第一个窗口的输入 - 我想预测X的第5行值,该值是Y的先前值。同样,使用简单的Y,我们可以按照以下方式执行此操作,但我希望使用rolling OLS regression model执行此操作。

random forest model

我用import pandas as pd df = pd.read_csv('data_pred.csv') model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']], window_type='rolling', window=5, intercept=True) 解决了这个问题,产生了random forest

df

这似乎适用于范围5,10,15,20,22。但是,对于大于23的范围(它打印t_stamp X Y X_t1 X_t2 X_t3 X_t4 X_t5 0.000543 0 10 NaN NaN NaN NaN NaN 0.000575 0 10 0.0 NaN NaN NaN NaN 0.041324 1 10 0.0 0.0 NaN NaN NaN 0.041331 2 10 1.0 0.0 0.0 NaN NaN 0.041336 3 10 2.0 1.0 0.0 0.0 NaN 0.041340 4 10 3.0 2.0 1.0 0.0 0.0 0.041345 5 10 4.0 3.0 2.0 1.0 0.0 0.041350 6 10 5.0 4.0 3.0 2.0 1.0 0.041354 7 10 6.0 5.0 4.0 3.0 2.0 ......................................................... [ 10. 10. 10. 10. .................................] MSE: 1.3273548431 )它似乎不能正常工作,这是因为,您可以从three column dataset看到MSE: 0.0的值从第1行到第23行是固定的(10),然后从第24行更改为另一个值(20,依此类推)。我们如何训练和根据最后的数据点预测此类案例的模型?

1 个答案:

答案 0 :(得分:1)

现在的代码似乎在调用dropna时,截断X但不截断y。您还可以训练和测试相同的数据。

修复此问题将导致非零MSE。

代码:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

df = pd.read_csv('/Users/shivadeviah/Desktop/estimated_pred.csv')

df1 = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(25)})
df1['Y'] = df['Y']
df1 = df1.sample(frac=1).reset_index(drop=True)
df1.dropna(inplace=True)

X = df1.iloc[:, :-1].values
y = df1.iloc[:, -1].values

x = int(len(X) * 0.66)

X_train = X[:x]
X_test = X[x:]
y_train = y[:x]
y_test = y[x:]

reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train, y_train)

modelPred = reg.predict(X_test)

print(modelPred)
print("Number of predictions:",len(modelPred))

meanSquaredError = mean_squared_error(y_test, modelPred)

print("MSE:", meanSquaredError)
print(df1.size)
df2 = df1.iloc[x:, :].copy()


df2['pred'] = modelPred

df2.head()

输出:

[ 267.7     258.26608241  265.07037249 ...,  267.27370169  256.7     272.2 ]
Number of predictions: 87891
MSE: 1954.9271256
6721026

        X_0       pred
170625  48  267.700000
170626  66  258.266082
170627  184 265.070372
170628  259 294.700000
170629  271 281.966667