Question

我想知道何时进行测试拆分（20％测试，80％80％），然后应用5倍交叉验证，这是否意味着所有数据都已在测试集中一次？还是每次都随机选择，同一事件可能多次包含在测试中，而某些事件可能从未包含在测试集中？

#20% of the data will be used as test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed) 

cv_results= cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)

Answer 1

所有数据都将在测试集中一次吗？是的，至少在传递给交叉验证方法的数据中。例如：

X = np.arange(10)
y = np.concatenate((np.ones(5), np.zeros(5)))
X
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y
array([1., 1., 1., 1., 1., 0., 0., 0., 0., 0.])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)
X_train
array([5, 0, 7, 2, 9, 4, 3, 6])
X_test
array([8, 1])
kf = Kfold(n_splits=5)
for train, test in kf.split(X_train):
    print(train, test)
[2 3 4 5 6 7] [0 1]
[0 1 4 5 6 7] [2 3]
[0 1 2 3 6 7] [4 5]
[0 1 2 3 4 5 7] [6]
[0 1 2 3 4 5 6] [7]

您会看到测试集的索引从0到7，这意味着X_train中的所有8个值将一次出现在交叉验证测试中。无论样本大小如何，这种模式都将持续。

通过交叉验证split方法创建的拆分的大小由您的数据与所选拆分的数量之比确定。例如，如果我设置了KFold(n_splits=8)（与我的X_train数组大小相同），则每个拆分的测试集将包含一个数据点。

scikit学习：5折交叉验证和培训测试分组

1 个答案: