Question

我很难使用nltk消除和标记.text文件。我不断收到以下错误消息：AttributeError：'list'对象没有属性'lower'。我只是无法弄清楚我做错了什么，虽然这是我第一次做这样的事情。以下是我的代码。感谢任何建议，谢谢

Import nltk
from nltk.corpus import stopwords
s = open("C:\zircon\sinbo1.txt").read()
tokens = nltk.word_tokenize(s)
def cleanupDoc(s):
        stopset = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(s)
    cleanup = [token.lower()for token in tokens.lower() not in stopset and  len(token)>2]
    return cleanup
cleanupDoc(s)

Answer 1

您可以使用NLTK中的stopwords列表，请参阅How to remove stop words using nltk or python。

最有可能你也想剥去标点符号，你可以使用string.punctuation，见http://docs.python.org/2/library/string.html：

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = stopwords.words('english') + list(string.punctuation)
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']

Answer 2

从错误消息中，您似乎正在尝试将列表而不是字符串转换为小写。你的tokens = nltk.word_tokenize(s)可能没有返回你期望的东西（这似乎是一个字符串）。

了解sinbo.txt文件的格式会很有帮助。

一些语法问题：

导入应为小写：import nltk
行s = open("C:\zircon\sinbo1.txt").read()正在读取整个文件，而不是一次读取一行。这可能有问题，因为word_tokenize工作on a single sentence，而不是任何令牌序列。此当前行假定您的sinbo.txt文件包含单个句子。如果没有，你可能想要（a）在文件上使用for循环而不是使用read（）或（b）在一大堆句子上使用punct_tokenizer除以标点符号。

cleanupDoc函数的第一行未正确缩进。你的函数应该是这样的（即使其中的函数发生了变化）。

import nltk
from nltk.corpus import stopwords 
def cleanupDoc(s):
 stopset = set(stopwords.words('english'))
 tokens = nltk.word_tokenize(s)
 cleanup = [token.lower() for token in tokens if token.lower() not in stopset and  len(token)>2]
 return cleanup

Answer 3

import nltk
from nltk.corpus import stopwords
def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = " ".join(filter(lambda word: word not in stopset, s.split()))
     return cleanup
s = "I am going to disco and bar tonight"
tokens = nltk.word_tokenize(s)
x = cleanupDoc(s)
print x

此代码可以帮助解决上述问题。

Answer 4

在您的特定情况下，错误出在 cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]

tokens是一个列表，因此不能对列表执行tokens.lower（）操作。因此，编写上述代码的另一种方式是

cleanup = [token.lower()for token in tokens if token.lower() not in stopset and  len(token)>2]

我希望这会有所帮助。

使用NLTK摆脱停用词和文档标记化

4 个答案: