训练和测试集中不同数量的特征 - 随机森林sklearn Python

时间:2017-06-19 13:27:54

标签: python scikit-learn random-forest

我使用Python中的sklearn包将随机森林回归模型拟合为如下所示的数据:

data_train = ['.3 0:.5 1:.2 3:.7 6:.9 7:.1','.2 1:.5 2:.7 4:-0.3 5:1 6:0.7','.1 0:.3 1:.3 2:.2 3:.1 4:-0.2 5:0.3 6:0.7','.5 0:.3 1:.3 2:.5 3:.6 4:-0.1 5:0.4 6:0.6','.4 1:.3 2:.2 3:.2 4:-0.6 5:0.7 6:0.8','.6 0:.8 1:.3 2:.4 3:.4 4:-0.2 5:0.3 6:0.10','.9 0:.3 1:.3 2:.2 3:-.4 4:-0.2 5:-0.3','.9 0:.3 1:.1 2:.1 3:-.4 4:-0.1 5:-0.3','.1 0:.3 1:.3 2:.2 3:-.5 4:-0.2 5:-0.5']
data_test = ['.2 0:.4 1:.65 3:.8 6:.1','.2 1:.3 2:.6 4:-0.2 5:.6 6:0.6','.5 0:.3 1:.3 2:.2 3:.1 4:-0.2 5:0.3 6:0.7','.5 0:.3 1:.3 2:.5 3:.6 4:-0.1 5:0.4 6:0.6','.4 1:.3 2:.2 3:.2 4:-0.6 5:0.7 6:0.8','.6 0:.8 1:.3 2:.4 3:.4 4:-0.2 5:0.3 6:0.10','.9 0:.3 1:.3 2:.2 3:-.4 4:-0.2 5:-0.3','.9 0:.3 1:.1 2:.1 3:-.4 4:-0.1 5:-0.3','.1 0:.3 1:.3 2:.2 3:-.5 4:-0.2 5:-0.5']

对于每一行,第一个变量是输出变量,其他变量是特征:值对。

我使用以下代码为数据创建稀疏矩阵:

def sparse_mat(data):
    row1 = []
    col1 = []
    data1 = []
    y = []
    for rownum,x in enumerate(data):
        x = x.strip()
        elems = x.split(' ')
        for e,elem in enumerate(elems):
            if e == 0:
                y.append(float(elem.strip()))
            else:
                colnum = int(elem.split(':')[0])
                value = float(elem.split(':')[1])
                row1.append(rownum)
                col1.append(colnum)
                data1.append(value)
    X = csc_matrix((data1, (row1, col1)))
    return X,y

X_train,y_train = sparse_mat(data_train)
X_test,y_test = sparse_mat(data_test)

然后使用以下代码拟合随机森林回归模型:

from scipy.sparse import csc_matrix,csr_matrix
from sklearn.ensemble import RandomForestRegressor

rf=RandomForestRegressor(n_estimators=50,max_features='sqrt')
rf=rf.fit(X_train,y_train)

然而,我尝试使用训练集中的模型使用以下代码获取测试集的输出变量的预测:

predictions=rf.predict(X_test)

我收到以下错误:

ValueError: Number of features of the model must match the input. Model n_features is 8 and input n_features is 7 

据我所知,训练集上的功能数量应该与测试集上的功能数量相匹配。但是,在现实世界中,当我训练模型来预测结果变量时,我可能不知道样本外测试集中可用的功能。有没有办法训练具有N个特征的随机森林模型,然后提供具有N-k特征的测试集并仍然获得预测?

1 个答案:

答案 0 :(得分:1)

I was running into this same problem at my job last week. The way we handled that issue was to create the extra feature in the test dataset and fill it with the imputed values from the training data.

When you start getting into the realm of dummifying class variables however, you can also run into the issue. Again, the approach we used was to group up the values with low cardinalities into a bucket together. If you're pulling data from a database, you'll want to implement this solution is SQL since you want to minimize as much of the data processing in Python, so get ready to use CASE WHEN statements.

There's no "correct" answer to this problem. It will all depend on the context of your problem and your data, but I'm just offering certain methods that I used for the same problem that you described.