Question

我想使用一系列单词及其频率，删除带有常见停用词的条目，然后将其写入.txt文件：

f= open('frequ_words.txt', 'w+')

frequ_words = pd.Series(' '.join(df['message']).lower().split()).value_counts()[:500]

stop_words = get_stop_words('de')

for i in stop_words:
        try:
            frequ_words.drop(i)
        except:
            pass

f.write(str(frequ_words))

f.close()

我还尝试了其他循环方法，例如：

for i in frequ_words:
    if i in stop_words:
        pass
    else:
        f.write(frequ_words)

f.close()

但是我无法使它正常工作。有建议吗？

编辑：

系列数据看起来像这样：

word1     89086
word2     85946
...
word500    1098

Answer 1

如果您有一系列单词频率，其中系列的索引是单词本身，则可以使用单个Pandas表达式words = words[words.index.values != stop_words.values]来过滤停用词。

这是一个使用系列的示例，该系列的外观与您上面粘贴的示例相似：

words = pd.Series(data = [89086, 85946, 1098], index = ['word1', 'word2', 'word500'])

word1      89086
word2      85946
word500     1098
dtype: int64

然后，如果您有另一个包含停用词作为其值的系列：

stop_words = pd.Series(data=['word2'])

0    word2
dtype: object

要过滤单词频率序列以排除停用词，请运行以下代码行：

words = words[words.index.values != stop_words.values]

哪个会输出您的原始单词频率系列，但会删除停用词：

word1      89086
word500     1098
dtype: int64

如果它们是停用词，则将其删除

1 个答案: