删除低频词

时间:2018-05-11 17:53:58

标签: python pandas dataframe text replace

我有一个包含2列的数据框,1列包含单词串,例如:

       Col1                 Col2
0       1          how to remove this word
1       5          how to remove the  word

我想删除在整个数据帧中出现一次的所有单词(阈值= 1),我会得到例如:(如果我可以指定阈值,则更好)

       Col1                 Col2
0       1          how to remove word
1       5          how to remove word

有什么建议吗?谢谢!

1 个答案:

答案 0 :(得分:7)

让我们尝试使用Counter

  1. 将句子分成单词
  2. 计算全球字频率
  3. 根据计算的频率过滤单词
  4. 加入并重新分配
  5. from collections import Counter
    from itertools import chain
    
    # split words into lists
    v = df['Col2'].str.split().tolist() # [s.split() for s in df['Col2'].tolist()]
    # compute global word frequency
    c = Counter(chain.from_iterable(v))
    # filter, join, and re-assign
    df['Col2'] = [' '.join([j for j in i if c[j] > 1]) for i in v]
    

    df
       Col1                Col2
    0     1  how to remove word
    1     5  how to remove word