获得良好的交叉验证分数,但Roc_auc分数非常糟糕

时间:2016-03-25 07:36:17

标签: machine-learning scikit-learn data-science auc

我对此很新,所以任何类型的信息都会有所帮助。道歉,如果我问了一个非常微不足道的问题。 我正在研究一个中等大小的数据集,其中包含大量零。我们已经应用了很多模型,并且k = 10的cv-skf得分已经超过了0.85,但是roc_auc得分被固定在0.5左右。我正在使用sklearn。以下是代码段。

train_dataset = pd.read_csv('./input/train.csv', index_col='ID')
test_dataset = pd.read_csv('./input/test.csv', index_col='ID')

#print_shapes()
# How many nulls are there in the datasets?
nulls_train = (train_dataset.isnull().sum()==1).sum()
nulls_test = (test_dataset.isnull().sum()==1).sum()
#print('There are {} nulls in TRAIN and {} nulls in TEST dataset.'.format(nulls_train, nulls_test))
# Remove constant features

def identify_constant_features(dataframe):
    count_uniques = dataframe.apply(lambda x: len(x.unique()))
    constants = count_uniques[count_uniques == 1].index.tolist()
    return constants

constant_features_train = set(identify_constant_features(train_dataset))

#print('There were {} constant features in TRAIN dataset.'.format(len(constant_features_train)))

# Drop the constant features
train_dataset.drop(constant_features_train, inplace=True, axis=1)


#print_shapes()
# Remove equals features

def identify_equal_features(dataframe):
    features_to_compare = list(combinations(dataframe.columns.tolist(),2))
    equal_features = []
    for compare in features_to_compare:
        is_equal = array_equal(dataframe[compare[0]],dataframe[compare[1]])
        if is_equal:
            equal_features.append(list(compare))
    return equal_features

equal_features_train = identify_equal_features(train_dataset)

#print('There were {} pairs of equal features in TRAIN dataset.'.format(len(equal_features_train)))

# Remove the second feature of each pair.

features_to_drop = array(equal_features_train)[:,1] 
train_dataset.drop(features_to_drop, axis=1, inplace=True)

#print_shapes()
# Define the variables model.

y_name = 'TARGET'
feature_names = train_dataset.columns.tolist()
feature_names.remove(y_name)

X = train_dataset[feature_names]
y = train_dataset[y_name]

# Save the features selected for later use.
pd.Series(feature_names).to_csv('features_selected_step1.csv', index=False)
#print('Features selected\n{}'.format(feature_names))


# Proportion of classes
y.value_counts()/len(y)

skf = cv.StratifiedKFold(y, n_folds=10, shuffle=True)
score_metric = 'roc_auc'
scores = {}

def score_model(model):
    return cv.cross_val_score(model, X, y, cv=skf, scoring=score_metric)

clfxgb = xgb.XGBClassifier()
clfxgb = clfxgb.fit(X, y)
probxgb = clfxgb.predict(X)
# #print 'XGB', np.shape(probxgb)
print metrics.roc_auc_score(y, probxgb)

输出 - 从numpy和matplotlib填充交互式命名空间 test.csv train.csv

0.502140359687

对于cv-skf -

cv.cross_val_score(xgb.XGBClassifier(), X, y, cv=skf, scoring=score_metric)

输出 - 数组([0.83124251,0.84162387,0.83580491])

我们将.csv文件提交为 -

test_dataset.drop(constant_features_train, inplace=True, axis=1)
test_dataset.drop(features_to_drop, axis=1, inplace=True)
print test_dataset.shape
X_SubTest = test_dataset
df_test = pd.read_csv('./input/test.csv')
id_test = df_test['ID']
predTest = model.predict(X_SubTest)
submission = pd.DataFrame({"ID":id_test, "TARGET":predTest})
submission.to_csv("submission_svm_23-3.csv", index=False)

1 个答案:

答案 0 :(得分:0)

您没有使用交叉验证信息来训练您的模型 - roc_auc和交叉验证分数意味着非常不同的东西。要获得更高的ROC分数,您需要进行模型选择 - 您需要选择具有最高交叉验证分数的模型(具有最佳参数)。一种方法是使用GridSearchCVhttp://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)之类的东西来搜索XGBoost模型的不同参数中的潜在模型空间。这样,您将专门选择您的模型,因为它具有高交叉验证roc_auc。

以下是Kaggle的详细示例:https://www.kaggle.com/tanitter/introducing-kaggle-scripts/grid-search-xgboost-with-scikit-learn/run/23363

相关问题