朴素贝叶斯概率错误Python

时间:2018-08-29 14:31:51

标签: python scikit-learn naivebayes

我有一个问题,我有2个数据集,AdultTest和AdultData。在这些数据集中,我有很多这样的行:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female , 2174, 0, 40, United-States, >50K

我想计算出“女性”拥有> 50K的概率,为此,我这样做了:

from sklearn.naive_bayes import BernoulliNB

#Read AdultData.csv and encoded in Integer, so can I calculate the NaiveBAyes
data1 = np.genfromtxt('AdultData.csv', delimiter=',',  dtype='int', skip_footer=1)
datatest=np.genfromtxt('adultTest.csv', delimiter=',',  dtype='int', skip_footer=1)

#Delete the last Column, because the last column is the Target
data_new = np.delete(data1, 14, 1)
dataTest_new = np.delete(datatest, 14, 1)

class_ = [row[14] for row in data2]

clf = BernoulliNB()
clf.fit(data_new, class_)
print(clf.predict_proba(dataTest_new))

结果是概率的预测,而我总是得到:

[1。 0。]

但是我不知道为什么,即使我输入了AdultTest(这些都有另一个数据),我也会收到相同的结果。

为什么我没有收到其他结果?此外,为什么我有2列?

P.S。之所以这样做,是因为我想做不区分的分类的按摩算法

有人可以帮忙吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

我认为您的代码中存在逻辑错误,因为您从不使用dataTest_new

data_new = np.delete(data1, 14, 1)
dataTest_new = np.delete(datatest, 14, 1)

class_ = [row[14] for row in data2]

clf = BernoulliNB()
clf.fit(data_new, class_)
# you should run prediction on test data
print(clf.predict_proba(dataTest_new))