Question

我有一个熊猫数据框，我想基于一个文本列进行2克频率的显示。

text_column
This is a book
This is a book that is read
This is a book but he doesn't think this is a book

最终结果是频率计数为2克，但频率是对每个文档中是否存在2克而不是2克进行计数。

因此部分结果将是

2 gram         Count
This is          3
a book           3

尽管这3个文本各有2个，但“ 3个文本”中都出现了“这是”和“一本书”，因为我只想知道这2克出现了多少文档，所以计数是3，所以4。

知道我该怎么做吗？

谢谢

Answer 1

Python式答案（写得很笼统，因此可以应用于文件/数据框/任何文件）：

c=collections.Counter()
for i in fh:
  x = i.rstrip().split(" ")
  c.update(set(zip(x[:-1],x[1:])))

现在c保持每2克的频率。

说明：

每行都是split，用空格隔开。
然后zip()返回一个长度为2（2克）的元组的迭代器。
将迭代器送入set()中以删除多余的内容。
然后将集合输入到collections.Counter()对象中，该对象跟踪每个元组出现的次数。您需要import collections才能使用它。
现在很容易列出计数器的内容或将其转换为您喜欢的任何其他格式（例如数据框）。

是的，Python很棒。

Answer 2

这是非常c的风格，但是可以。想法是跟踪每个文档的“当前”二元组，确保每个文档（cur_bigrams = set()仅添加一次，然后在每个文档后增加全局频率计数器（bigram_freq）如果它在当前文档中。然后根据bigram_freq中的信息（跨文档的全局计数器）构建一个新的数据框。

bigram_freq = {}
for doc in df["text_column"]:
    cur_bigrams = set()
    words = doc.split(" ")
    bigrams = zip(words, words[1:])
    for bigram in bigrams:
        if bigram not in cur_bigrams: # Add bigram, but only once/doc
            cur_bigrams.add(bigram)
    for bigram in cur_bigrams:
        if bigram in bigram_freq:
            bigram_freq[bigram] += 1
        else:
            bigram_freq[bigram] = 1

result_df = pd.DataFrame(columns=["2_gram", "count"])
row_list = []
for bigram, freq in bigram_freq.items():
    row_list.append([bigram[0] + " " + bigram[1], freq])
for i in range(len(row_list)):
    result_df.loc[i] = row_list[i]

print(result_df)

输出：

           2_gram count
0          a book     3
1            is a     3
2         This is     3
3         is read     1
4         that is     1
5       book that     1
6      he doesn't     1
7         this is     1
8        book but     1
9          but he     1
10     think this     1
11  doesn't think     1

您可以使用更具功能性的样式和/或列表理解功能来将代码精简一些。我将其留给读者练习。

Python N_gram频率计数

2 个答案: