随机森林误差(输入变量的样本数不一致)

时间:2018-12-29 13:23:26

标签: python random-forest feature-selection

在阅读了很多带有“样本数量不一致”错误的示例之后,我仍然看不到我的代码有什么问题。

在一个excel文件中,工作表1包含数据。表格2列出了变量列表。

我将工作表2中的变量保存到一个数组中。并将其输入随机森林模型以评估其对工作表1中参数的影响。

但是我得到的是“找到的输入变量样本数量不一致:[54,2016]”

54是工作表2中的变量数。 2016是工作表1中的数据行数。

我试图查看这54个变量如何影响工作表1中的“目标”变量。

我应该如何处理我的数据以使其正常工作?

非常感谢。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)

df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)

print(len(df2.columns))

allvar = list()

for each_var in df2.columns:
    allvar.append(each_var)

allvar = np.array(allvar)
print(allvar)

target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)

allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)

clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

clf.fit(allvar_train, target_train)

for feature in zip(feat_labels, clf.feature_importances_):
    print(feature)

第1张(保存为df)如下所示 Sheet 1

第2张(保存为df2)看起来像这样 Sheet2

错误日志如下所示 Error log

错误日志2:未知标签类型:“连续” Error Log 2

allvar_train

target train

2 个答案:

答案 0 :(得分:1)

问题出在“ train_test_spilt”上,您只传递要素列名称而不传递数据。像这样使用列列表从DataFrame获取数据。

allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)

您不一定需要将'allvar'和'target'转换为numpy数组,它可以直接在'train_test_split'中使用。

注意:此问题与随机森林无关

答案 1 :(得分:-1)

这是对我有用的代码。

  import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt

    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split

    from sklearn.ensemble import RandomForestRegressor
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.feature_selection import SelectFromModel
    from sklearn.metrics import accuracy_score

    df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
    df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)

    df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
    df.set_index(df['DateTime'], inplace=True)

    print(len(df2.columns))

    allvarlist = list()

    for each_var in df2.columns:
        allvarlist.append(each_var)

    countvar = len(allvarlist)

    allvar = df[allvarlist]
    allvar = allvar.values.reshape(len(allvar),countvar)

    target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
    target=target.values.reshape(len(target),1)

    allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)

    clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)

    #print(allvar_train)
    #print(target_train)

    clf.fit(allvar_train,np.ravel(target_train))

    for feature in zip(allvarlist, clf.feature_importances_):
        print(feature)

    importances = clf.feature_importances_
    #indices = np.argsort(importances)

    plt.figure().set_size_inches(14,16)
    plt.barh(range(allvar_train.shape[1]), importances, color="r")
    plt.yticks(range(allvar_train.shape[1]),allvarlist)