如何纠正我的朴素贝叶斯方法返回极小的条件概率?

时间:2016-05-07 21:19:35

标签: python algorithm computer-science data-retrieval

我试图计算电子邮件是朴素贝叶斯垃圾邮件的概率。我有一个文档类来创建文档(从网站提供),另一个类用于训练和分类文档。我的列车功能计算所有文档中的所有唯一条款,垃圾邮件类中的所有文档,非垃圾邮件类中的所有文档,计算先验概率(一个用于垃圾邮件,另一个用于火腿)。然后我使用以下公式将每个术语的条件概率存储到dict中

condprob
Tct =给定类别中术语的出现次数
TCT'是给定班级中的#术语 B' =所有文件中的#唯一条款


classes =垃圾邮件或火腿
spam = spam,ham = not spam

问题在于,当我在我的代码中使用这个公式时,它给出了非常小的条件概率分数,例如2.461114392596968e-05。我非常肯定这是因为与Tct'的分母值相比,Tct的值非常小(如5或8)。 (火腿是64878,垃圾邮件是308930)和B' (这是16386)。我无法弄清楚如何将condprob分数降低到像.00034155那样,因为我只能假设我的condprob分数不应该像它们那样指数级小。我的计算错了吗?这些值实际上应该是这么小吗? 如果它有帮助,我的目标是为一组测试文件打分并获得如327.82,758.80或138.66的结果。 使用这个公式
score function
但是,使用我的小condprob值我只得到负数。

代码

- 创建文档

class Document(object):
"""
The instance variables are:
filename....The path of the file for this document.
label.......The true class label ('spam' or 'ham'), determined by whether the filename contains the string 'spmsg'
tokens......A list of token strings.
"""

def __init__(self, filename=None, label=None, tokens=None):
    """ Initialize a document either from a file, in which case the label
    comes from the file name, or from specified label and tokens, but not
    both.
    """
    if label: # specify from label/tokens, for testing.
        self.label = label
        self.tokens = tokens
    else: # specify from file.
        self.filename = filename
        self.label = 'spam' if 'spmsg' in filename else 'ham'
        self.tokenize()

def tokenize(self):
    self.tokens = ' '.join(open(self.filename).readlines()).split()

-NaiveBayes

class NaiveBayes(object):
def train(self, documents):
    """
    Given a list of labeled Document objects, compute the class priors and
    word conditional probabilities, following Figure 13.2 of your
    book. Store these as instance variables, to be used by the classify
    method subsequently.
    Params:
      documents...A list of training Documents.
    Returns:
      Nothing.
    """
    ###TODO
    unique = []
    proxy = []
    proxy2 = []
    proxy3 = []
    condprob = [{},{}]
    Tct = defaultdict()
    Tc_t = defaultdict()
    prior = {}
    count = 0
    oldterms = []
    old_terms = []
    for a in range(len(documents)):
        done = False
        for item in documents[a].tokens:
            if item not in unique:
                unique.append(item)
            if documents[a].label == "ham":
                proxy2.append(item)
                if done == False:
                    count += 1
            elif documents[a].label == "spam":
                proxy3.append(item)
            done = True
    V = unique
    N = len(documents)
    print("N:",N)
    LB = len(unique)
    print("THIS IS LB:",LB)
    self.V = V
    print("THIS IS COUNT/NC", count)
    Nc = count
    prior["ham"] = Nc / N
    self.prior = prior
    Nc = len(documents) - count
    print("THIS IS SPAM COUNT/NC", Nc)
    prior["spam"] = Nc / N
    self.prior = prior
    text2 = proxy2
    text3 = proxy3
    TctTotal = len(text2)
    Tc_tTotal = len(text3)
    print("THIS IS TCTOTAL",TctTotal)
    print("THIS IS TC_TTOTAL",Tc_tTotal)
    for term in text2:
        if term not in oldterms:
            Tct[term] = text2.count(term)
            oldterms.append(term)
    for term in text3:
        if term not in old_terms:
            Tc_t[term] = text3.count(term)
            old_terms.append(term)
    for term in V:
        if term in text2:
            condprob[0].update({term: (Tct[term] + 1) / (TctTotal + LB)})
        if term in text3:
            condprob[1].update({term: (Tc_t[term] + 1) / (Tc_tTotal + LB)})
    print("This is condprob", condprob)
    self.condprob = condprob

def classify(self, documents):
    """ Return a list of strings, either 'spam' or 'ham', for each document.
    Params:
      documents....A list of Document objects to be classified.
    Returns:
      A list of label strings corresponding to the predictions for each document.
    """
    ###TODO
    #return list["string1", "string2", "stringn"]
    # docs2 = ham, condprob[0] is ham
    # docs3 = spam, condprob[1] is spam
    unique = []
    ans = []
    hscore = 0
    sscore = 0
    for a in range(len(documents)):
        for item in documents[a].tokens:
            if item not in unique:
                unique.append(item)
        W = unique
        hscore = math.log(float(self.prior['ham']))
        sscore = math.log(float(self.prior['spam']))
        for t in W:
            try:
                hscore += math.log(self.condprob[0][t])
            except KeyError:
                continue
            try:
                sscore += math.log(self.condprob[1][t])
            except KeyError:
                continue
        print("THIS IS SSCORE",sscore)
        print("THIS IS HSCORE",hscore)
        unique = []
        if hscore > sscore:
            str = "Spam"
        elif sscore > hscore:
            str = "Ham"
        ans.append(str)

    return ans

-Test

if not os.path.exists('train'):  # download data
from urllib.request import urlretrieve
import tarfile

urlretrieve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz', 'lingspam.tgz')
tar = tarfile.open('lingspam.tgz')
tar.extractall()
tar.close()
train_docs = [Document(filename=f) for f in glob.glob("train/*.txt")]
test_docs = [Document(filename=f) for f in glob.glob("test/*.txt")]
test = train_docs

nb = NaiveBayes()
nb.train(train_docs[1500:])
#uncomment when testing classify()
#predictions = nb.classify(test_docs[:200])
#print("PREDICTIONS",predictions)

最终的目标是能够将文档分类为垃圾邮件或火腿,但我想首先处理条件概率问题。

问题
条件概率值应该是这么小吗?如果是这样,为什么我通过分类获得奇怪的分数?如果没有,我如何修复我的代码以给我正确的condprob值?


我得到的当前condprob值是这样的:

'传统':2.461114392596968e-05,' fillmore':2.461114392596968e-05,' 796':2.461114392596968e-05,' zann' :2.461114392596968e-05

condprob是一个包含两个词典的列表,第一个是火腿,第二个是垃圾邮件。每个字典都将一个术语映射到它的条件概率。我希望"正常"小值,例如.00031235而不是3.1235e-05。 这样做的原因是,当我通过classify方法运行condprob值和一些测试文档时,得到的分数如

这是HSCORE -2634.5292392650663,这是SSCORE -1707.983339196181

什么时候看起来像什么 这是HSCORE 327.82,这是SSCORE 758.80

运行时间

~1分钟,30秒

1 个答案:

答案 0 :(得分:1)

(你似乎正在使用日志概率,这是非常明智的,但我将为原始概率编写以下大部分内容,你可以通过获取日志概率的指数来获得,因为它使得代数更容易,即使它在实践中意味着你可能会得到数字下溢,如果你不使用日志)

据我所知,你可以从先前的概率p(Ham)和p(Spam)开始,然后使用从先前数据估计的概率来计算p(Ham)* p(观测数据| Ham)和p(垃圾邮件)* p(观察数据|垃圾邮件)。

贝叶斯定理重新排列p(Obs | Spam)= p(Obs& Spam)/ p(Spam)= p(Obs)p(Spam | Obs)/ p(Spam)给你P(Spam | Obs) = p(垃圾邮件)p(Obs |垃圾邮件)/ p(Obs)你似乎计算了p(垃圾邮件)p(Obs | Spam)= p(Obs& Spam)但没有除以p(Obs)。由于Ham和Spam只有两种可能性,最容易做的就是注意p(Obs)= p(Obs& Spam)+ p(Obs& Ham),所以只需将你的两个计算得分它们的总和值,基本上是缩放值,使它们确实总和为1.0。

如果从日志概率lA和lB开始,这种缩放比较棘手。为了缩放这些,我首先将它们通过粗略值作为对数来缩放它们,所以做一个减法

lA = lA - max(lA,lB)

lB = lB - max(lA,lB)

现在至少两者中较大者不会溢出。较小的仍然可能,但我宁愿处理下溢而不是溢出。现在将它们变成不完全缩放的概率:

pA = exp(lA)

pB = exp(lB)

并正确缩放以便将它们添加到零

truePA = pA /(pA + pB)

truePB = pB /(pA + pB)

相关问题