计算两个文档之间的对称Kullback-Leibler分歧

时间:2016-02-18 13:35:34

标签: python nlp similarity information-retrieval

我已经按照文件here和代码here(使用对称kld和第一个链接中提出的退避模型实现)来计算两个文本之间的KLD数据集。我最后更改了for循环以返回两个数据集的概率分布,以测试两者总和为1:

import re, math, collections

def tokenize(_str):
    stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
                 'are', 'will', 'in', 'it', 'to', 'that']
    tokens = collections.defaultdict(lambda: 0.)
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stopwords: continue
        tokens[m] += 1

    return tokens
#end of tokenize

def kldiv(_s, _t):
    if (len(_s) == 0):
        return 1e33

    if (len(_t) == 0):
        return 1e33

    ssum = 0. + sum(_s.values())
    slen = len(_s)

    tsum = 0. + sum(_t.values())
    tlen = len(_t)

    vocabdiff = set(_s.keys()).difference(set(_t.keys()))
    lenvocabdiff = len(vocabdiff)

    """ epsilon """
    epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001

    """ gamma """
    gamma = 1 - lenvocabdiff * epsilon

    """ Check if distribution probabilities sum to 1"""
    sc = sum([v/ssum for v in _s.itervalues()])
    st = sum([v/tsum for v in _t.itervalues()])

    ps=[] 
    pt = [] 
    for t, v in _s.iteritems(): 
        pts = v / ssum 
        ptt = epsilon 
        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        ps.append(pts) 
        pt.append(ptt)
    return ps, pt

我已经测试了

d1 = """Many research publications want you to use BibTeX, which better organizes the whole process. Suppose for concreteness your source file is x.tex. Basically, you create a file x.bib containing the bibliography, and run bibtex on that file.""" d2 = """In this case you must supply both a \left and a \right because the delimiter height are made to match whatever is contained between the two commands. But, the \left doesn't have to be an actual 'left delimiter', that is you can use '\left)' if there were some reason to do it."""

sum(ps) = 1,但sum(pt)小于1时:

This should be the case.

代码中是否存在不正确的内容?谢谢!

更新

为了使pt和ps总和为1,我必须将代码更改为:

    vocab = Counter(_s)+Counter(_t)
    ps=[] 
    pt = [] 
    for t, v in vocab.iteritems(): 
        if t in _s:
            pts = gamma * (_s[t] / ssum) 
        else: 
            pts = epsilon

        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        else:
            ptt = epsilon

        ps.append(pts) 
        pt.append(ptt)

    return ps, pt

2 个答案:

答案 0 :(得分:2)

sum(ps)和sum(pt)是_s和_t 的总概率质量超过s 的支持(“支持s”我的意思是出现在_s中的所有单词,无论出现在_t)中的单词。 这意味着

  1. sum(ps)== 1,因为for循环对_s。
  2. 中所有单词求和
  3. sum(pt)&lt; = 1,如果t的支持是s的支持的子集(即,_t中的所有单词都出现在_s中),则等式将成立。此外,如果_s和_t中的单词之间的重叠很小,则sum(pt)可能接近于0。具体来说,如果_s和_t的交集是空集,则sum(pt)== epsilon * len(_s)。
  4. 所以,我认为代码没有问题。

    此外,与问题的标题相反,kldiv()不计算对称的KL-发散,而是计算_s与_t的平滑版本之间的KL-发散。

答案 1 :(得分:0)

每个文档的概率分布总和存储在变量scst中,它们接近于1.