Question

我使用sklearn.feature_selection.chi2进行功能选择，发现了一些意想不到的结果（请检查代码）。有谁知道原因是什么，或者可以向我指出一些文档或请求请求？

我将通过手动操作并使用scipy.stats.chi2_contingency获得的结果与预期的结果进行比较。

代码：

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
from sklearn.feature_selection import chi2, SelectKBest

x = np.array([[1, 1, 1, 0, 1], [1, 0, 1, 0, 0], [0, 0, 1, 1, 1], [0, 0, 1, 1, 0], [0, 0, 0, 1, 1], [0, 0, 0, 1, 0]])
y = np.array([1, 1, 2, 2, 3, 3])

scores = []
for i in range(x.shape[1]):
    result = chi2_contingency(pd.crosstab(x[:, i], y))
    scores.append(result[0])

sel = SelectKBest(score_func=chi2, k=3)
sel.fit(x, y)

print(scores)
print(sel.scores_)
print(sel.get_support())

结果是：

[6., 2.4, 6.0, 6.0, 0.0] (Expected)
[4. 2. 2. 2. 0.] (Unexpected)
[ True  True False  True False]

使用scipy，它保留特征0、2、3，而使用sklearn，它保留特征0、1、3。

Answer 1

首先，使用scipy实现进行计算时，观察值和期望值会互换，应该为

scores = []
for i in range(x.shape[1]):
    result = chi2_contingency(pd.crosstab(y,x[:,i] ))
    scores.append(result[0])

所以现在的结果是：

[6.000000000000001, 2.4000000000000004, 6.000000000000001, 6.000000000000001, 0.0]

带有sklearn的chi2的人是

[4. 2. 2. 2. 0.]

现在我进入源代码，它们对卡方值的计算几乎没有什么不同

sklearn实现 您可以先检查line 171 where chi2 class is defined，这是sklearn中的实现，然后再传递给_chisquare类。

科学实施 您可以查看scipy implementation here，它调用this function最终计算出卡方值。

从实现中可以看到，值的差异是由于它们在计算卡方值之前对观察值和期望值执行了转换。

参考文献：

chi square feature selection using scipy

Scipy和Sklearn chi2实现产生不同的结果

1 个答案: