Question

是否有更好（更快）的方法来从csv文件中删除停用词？

这是简单的代码，一个小时后，我仍在等待结果（所以我什至不知道它是否真的在工作）：

import nltk
from nltk.corpus import stopwords
import csv
import codecs

f = codecs.open("agenericcsvfile.csv","r","utf-8")
readit = f.read()
f.close()

filtered = [w for w in readit if not w in stopwords.words('english')]

csv文件有50.000行，共约1500万个单词。为什么要花这么长时间？可悲的是，这只是一个子集。我将不得不使用超过100万行和超过3亿个单词来做到这一点。那么有什么方法可以加快速度吗？还是更优雅的代码？

CSV文件示例：

1 text,sentiment
2 Loosely based on The Decameron, Jeff Baena's subversive film takes us behind the walls of a 13th century convent and squarely in the midst of a trio of lustful sisters, Alessandra (Alison Brie), Fernanda (Aubrey Plaza), and Ginerva (Kate Micucci) who are "beguiled" by a new handyman, Massetto (Dave Franco). He is posing as a deaf [...] and it is coming undone from all of these farcical complications.,3
3 One might recommend this film to the most liberally-minded of individuals, but even that is questionable as [...] But if you are one of the ribald loving few, who likes their raunchy hi-jinks with a satirical sting, this is your kinda movie. For me, the satire was lost.,5
4 [...]
[...]
50.000 The movie is [...] tht is what I ahve to say.,9

所需的输出将是不带停用词的同一csv文件。

Answer 1

第一个明显的优化方法是1 /避免在每次迭代中调用stopwords.words()，并且2 /将其设为set（set查找为O（1），其中{{1} }查找为O（N））：

list

但这不会产生预期的结果，因为words = set(stopwords.words("english")) filtered = [w for w in readit if not w in words]是一个字符串，因此您实际上是在迭代单个字符而不是单词。您需要先对字符串进行标记，[如此处所述] [1]：

readit

但是现在您已经失去了所有的csv换行符，因此您无法正确地重建它...而且，如果csv中有任何引用，那么引用也可能会遇到一些问题。因此，实际上，您可能希望使用from nltk.tokenize import word_tokenize readit = word_tokenize(readit) # now readit is a proper list of words... filtered = [w for w in readit if not w in words]正确地解析源，并逐行逐行清理数据，这当然会增加一些开销。好吧，如果您的目标是在不使用停用词的情况下重建csv，那就是（否则，您可能不太在意）。

回答：如果您有一个非常庞大的语料库需要清理，并且需要性能，那么下一步就是真正的并行化：将源数据分成多个部分，将每个部分发送到不同的进程（每个处理器/内核一个是一个好的开始），可能分布在许多计算机上，并收集结果。这种模式称为“地图缩小”，它们已经是几个Python实现。

Answer 2

似乎NLTK返回的停用词是list，因此具有O（n）查找。首先将列表转换为set，然后会更快。

>>> some_word = "aren't"
>>> stop = stopwords.words('english')
>>> type(stop)
list
>>> %timeit some_word in stop
1000000 loops, best of 3: 1.3 µs per loop

>>> stop = set(stopwords.words('english'))
>>> %timeit some_word in stop
10000000 loops, best of 3: 43.8 ns per loop

但是，尽管这应该可以解决性能问题，但似乎您的代码并没有首先执行您期望的工作。 readit是包含整个文件内容的单个字符串，因此您要迭代字符而不是单词。您导入了csv模块，但从未使用过。另外，您的csv文件中的字符串应加引号，否则它将在 all ,处分割，而不仅仅是在最后一个。如果您无法更改csv文件，则使用str.rsplit可能会更容易。

texts = [line.rsplit(",", 1)[0] for line in readit.splitlines()]
filtered = [[w for w in text.split() if w.lower() not in stopwords_set]
            for text in texts]

加快从庞大的csv文件中删除停用词的速度

2 个答案: