Question

I am using LinearRegression() from sklearn to predict. I have created different features for X and trying to understand how can i select the best features automatically? Let's say i have defined 50 different features for X and only one output for y. Is there a way to select the best performing features automatically instead of doing it manually?

Also I can get rmse using following command:

scores = np.sqrt(-cross_val_score(lm, X, y, cv=20, scoring='neg_mean_squared_error')).mean()

From now on, how can i use this RMSE scores? I mean do i have to make multiple predictions? How am i going to use this rmse? There must be a way to predict() using some optimisations but couldn't findout.

Answer 1

Actually sklearn doesn't seem to have a stepwise algorithm, which helps in understanding the importance of features. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection.

See the documentation here:

Recursive Feature Elimination

Note that it is not necessary that it will reduce your RMSE. You might try different techniques like Ridge and Lasso Regression as well.

Answer 2

RMSE测量预测误差的平均幅度。

RMSE对高误差赋予高权重，降低值总是更好。只有拥有合适的模型才能改进RMSE。对于特征选择，您可以使用PCA或逐步回归或基本相关技术。如果你看到很多多重共线性，那就选择Lasso或Ridge回归。此外，请确保您有一个不错的测试和训练数据分割。如果您的测试数据不好，您将得到不好的结果。另外，检查训练数据R-sq和测试数据R-sq以确保模型不会过度拟合。如果你在no上添加信息会很有帮助。您的测试和训练数据以及r-sq值的观察结果。希望这有帮助

What is the best way to minimize the RMSE?

2 个答案: