Question

我正在编写一个java软件，用于计算String列表中monoGram和biGram的频率。通过字符串获取nGrams字符串不是问题，我在计算“精确”频率时遇到困难。我的意思是，如果biGram包含两个monoGram，我想将biGram的频率减去monoGrams的两个频率。

示例

我想计算这个字符串中的频率：

golden credit card

我算上monoGrams：

"golden" with freq 1
"credit" with freq 1
"card" with freq 1

和biGrams：

"golden credit" with freq 1
"credit card" with freq 1

现在我计算“确切”的频率：

"golden credit"包含"golden"和"credit"，因此我将从其他两个中减去"golden credit"个频率：

goldenFreq -= 1
creditFreq -= 1

对于另一个biGram "credit card"同样如此：

creditFreq -= 1
cardFreq -= 1

现在你可以看到monoGrams：

"golden" has freq 0
"credit" has freq -1
"card" has freq 0

这是真正的问题！我不希望在以下两个biGrams中包含的单词（在这种情况下为"credit"）被计数两次，因此计数不会低于零（或者不会低于它应该的位置。去）。

Answer 1

似乎“精确”频率是一个单词作为字母组合出现的频率，但也不是一个双字母组合的一部分。如果你仔细观察一下，你应该只在字母组合词的频率中减去1，因为在任何二元组中都不会出现该字的每一个，而不是在所有的双字母组中。你的例子很好地解释了为什么：字母组合词只被计算一次，但在二元组中出现两次。

要解决此问题，您应该有额外的累积逻辑而不是后处理逻辑。一旦你有原始的会标和双字母计数，你就不知道一个单词出现在（0,1或2）中有多少重叠的双字母，所以你必须在阅读文本时检查它。这样做的方法如下：

for each word in the text:
    add to raw monogram count for word
    bigramFound = false
    if this word and previous word make a bigram:
         increment count for bigram
         bigramFound = true
    if this word and next word make a bigram:
         increment count for bigram
         bigramFound = true
    if not bigramFound:
         increment "exact" monogram count for word

计算nGrams频率

1 个答案: