列车的准确度比测试多高,足以考虑模型是否过度装配?

时间:2018-05-02 21:25:11

标签: python machine-learning scikit-learn adaboost ensemble-learning

在二元分类问题中考虑具有40个特征的920个样本的数据集。数据集是此处公开提供的心脏病数据集:archive.ics.uci.edu/ml/datasets/heart+Disease

我预处理了数据集,丢弃了那些包含> 50%缺失数据的功能。包含< 50%我使用MICE ncbi.nlm.nih.gov/pmc/articles/PMC3074241对其进行了估算。然后我通过收集相等范围的值然后通过OHE编码特征将数字数据打包成分类数据。有关数据集的信息如下:

enter image description here

考虑的模型是: 线性SVM,RBF SVM,Logistic回归,KNN,决策树,随机森林

我使用RandomSearchCV调整了参数:

> Model: SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

> Model: SVC(C=10000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

> Model: LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

> Model: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=65, p=2,
           weights='uniform')

> Model: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='random')

> Model: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=30, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

我在10-CV中得到以下分数(平均值+/- std): enter image description here

令我惊讶的是,训练准确度和F分数(特别是在随机森林中)高于测试分数。

除了我用相同的方法计算装袋算法(使用RandomSearchCV调整估算器的数量并使用base_estimator考虑以前的模型):

Model: BaggingClassifier(base_estimator=SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=200, n_jobs=1, oob_score=False,
         random_state=36, verbose=0, warm_start=False)

Model: BaggingClassifier(base_estimator=SVC(C=10000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=37, verbose=0, warm_start=False)

Model: BaggingClassifier(base_estimator=LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=38, verbose=0, warm_start=False)

Model: BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=65, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=38, verbose=0, warm_start=False)

Model: BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='random'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=500, n_jobs=1, oob_score=False,
         random_state=34, verbose=0, warm_start=False)

Model: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=30, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

我在10-CV中得到以下分数(平均值+/- std):

enter image description here

令我惊讶的是,从单一算法到袋装算法没有太多改进。

另一方面,我也尝试使用决策树和逻辑回归AdaBoost(与随机SearchCV相同的n_estimators调整和学习率:

Model: AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          learning_rate=10, n_estimators=600, random_state=38)

Model: AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='random'),
          learning_rate=0.01, n_estimators=200, random_state=34)

我也得到了以下分数:

enter image description here

阅读有关合奏方法的人可能会认为其准确度从单一模型得到改善,但逻辑回归甚至恶化(只是一点点),但对于决策树,它并没有真正改善那么多< 1%。

可能发生了什么?我是否过度拟合?

编辑: 正如@ Quickbeam2k1指出的那样,单热编码组不具有相同数量的数据点。换句话说,一些组太小,因此它们不会出现在测试集中,并且算法将无法很好地概括。查看示例:

enter image description here

此外,一些功能非常不平衡,在某些情况下,每个功能只有几个样本。所以我认为我需要放弃一些OHE功能。问题是,我应该考虑丢弃哪个门槛?我正在考虑丢弃这些特征,这些特征的百分比不会比其值的比例相等比例低10%。意思是,如果一个特征,比如胸痛,得到4个值,则有cp_1,cp_2,cp_3,cp_4。如果其中一个OHE值小于25% - 10%(因为有4个值,相等比率为每个值25%)= 15%。如果某个功能在此功能中没有达到15%的样本,则将其丢弃。这是一个好方法吗?

此外,正如@HarisNadeem所说,套袋算法旨在减少数据集的方差。由于OneHotEncoded功能,我显着减少了数据集,因此它没有真正改进算法​​。

0 个答案:

没有答案