随机森林精度太低

时间:2019-03-31 15:56:31

标签: python machine-learning scikit-learn random-forest

我想通过使用randomforest来预测用电量。在对数据进行调整之后,最新状态如下

X=df[['Temp(⁰C)','Araç Sayısı (adet)','Montaj V362_WH','Montaj V363_WH','Montaj_Temp','avg_humidity']]

X.head(15)

输出:

Temp(⁰C)    Araç Sayısı (adet)  Montaj V362_WH  Montaj V363_WH  Montaj_Temp avg_humidity
0   3.250000    0.0 0.0 0.0 17.500000   88.250000
1   3.500000    868.0   16.0    18.0    20.466667   82.316667
2   3.958333    774.0   18.0    18.0    21.166667   87.533333
3   6.541667    0.0 0.0 0.0 18.900000   83.916667
4   4.666667    785.0   16.0    18.0    20.416667   72.650000
5   2.458333    813.0   18.0    18.0    21.166667   73.983333
6   -0.458333   804.0   16.0    18.0    20.500000   72.150000
7   -1.041667   850.0   16.0    16.0    19.850000   76.433333
8   -0.375000   763.0   16.0    18.0    20.500000   76.583333
9   4.375000    1149.0  16.0    16.0    21.416667   84.300000
10  8.541667    0.0 0.0 0.0 21.916667   71.650000
11  6.625000    763.0   16.0    18.0    22.833333   73.733333
12  5.333333    783.0   16.0    16.0    22.166667   69.250000
13  4.708333    764.0   16.0    18.0    21.583333   66.800000
14  4.208333    813.0   16.0    16.0    20.750000   68.150000

y.head(15)

输出:

    Montaj_ET_kWh/day
0   11951.0
1   41821.0
2   42534.0
3   14537.0
4   41305.0
5   42295.0
6   44923.0
7   44279.0
8   45752.0
9   44432.0
10  25786.0
11  42203.0
12  40676.0
13  39980.0
14  39404.0

   X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=None)

   clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
   clf.fit(X_train, y_train['Montaj_ET_kWh/day'])
   for feature in zip(feature_list, clf.feature_importances_):
        print(feature)

输出

  ('Temp(⁰C)', 0.11598075020423881)
  ('Araç Sayısı (adet)', 0.7047301384616493)
  ('Montaj V362_WH', 0.04065706901940535)
  ('Montaj V363_WH', 0.023077554218712878)
  ('Montaj_Temp', 0.08082006262985514)
  ('avg_humidity', 0.03473442546613837)


 sfm = SelectFromModel(clf, threshold=0.10)
 sfm.fit(X_train, y_train['Montaj_ET_kWh/day'])

 for feature_list_index in sfm.get_support(indices=True):
      print(feature_list[feature_list_index])

输出:

  Temp(⁰C)
  Araç Sayısı (adet)

 X_important_train = sfm.transform(X_train)
 X_important_test = sfm.transform(X_test)

 clf_important = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
 clf_important.fit(X_important_train, y_train)
 y_test=y_test.values
 y_pred = clf.predict(X_test)
 y_test=y_test.reshape(-1,1)
 y_pred=y_pred.reshape(-1,1)
 y_test=y_test.ravel()
 y_pred=y_pred.ravel()
 label_encoder = LabelEncoder()
 y_pred = label_encoder.fit_transform(y_pred)
 y_test = label_encoder.fit_transform(y_test)

 accuracy_score(y_test, y_pred)

输出:

 0.010964912280701754

我不知道为什么准确性太低,我不知道哪里出错了

1 个答案:

答案 0 :(得分:2)

您的错误是您要在回归设置中要求准确性(分类指标),该设置毫无意义

accuracy_score documentation中(添加了重点):

  

sklearn.metrics.accuracy_score (y_true,y_pred,normalize = True,sample_weight = None)

     

准确性分类得分。

检查scikit-learn中可用的list of metrics,以获取合适的回归指标(您还可以确认准确性仅用于分类);有关更多详细信息,请参见Accuracy Score ValueError: Can't Handle mix of binary and continuous target

中的答案