如何从大型csv文件生成unigram,bigram和trigram并使用nltk或纯python计算其频率

时间:2018-05-23 09:12:14

标签: python-2.7 nltk n-gram

我使用了这段代码,并从给定的文本中生成了unigrams,bigrams,trigrams。但我想从一个大型csv文件的特定库中提取unigram,bigram和trigram。请帮助我如何进行

enter image description here

1 个答案:

答案 0 :(得分:0)

首先,一些花哨的代码来生成DataFrame。

from io import StringIO

import pandas as pd

sio = StringIO("""I am just going to type up something because you inserted an image instead ctr+c and ctr+v the code to Stackoverflow.
Actually, it's unclear what you want to do with the ngram counts.
Perhaps, it might be better to use the `nltk.everygrams()` if you want a global count.
And if you're going to build some sort of ngram language model, then it might not be efficient to do it as you have done too.""")

with sio as fin:
    texts = [line for line in fin]

df = pd.DataFrame({'text': texts})

然后,您可以轻松使用DataFrame.apply来提取ngrams,例如

from collections import Counter
from functools import partial

from nltk import ngrams, word_tokenize

for i in range(1, 4):
    _ngrams = partial(ngrams, n=i)
    df['{}-grams'.format(i)] = df['text'].apply(lambda x: Counter(_ngrams(word_tokenize(x))))
相关问题