尝试这样：

Question

我正在通过Pandas导入csv文件，格式如下：

test = [
    ('the beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]

我想检查停止列表中的任何单词是否包含在定义的测试集中，如果是，请将其删除。但是，在尝试这样做时，我只是返回完整列表而不做任何更改。这是我目前的代码：

df = pd.read_csv('test.csv', delimiter=',')
tlist = [tuple(x) for x in df.values]
tlist = [(x.lower(), y.lower()) for x,y in tlist]

def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            if word not in stopwords.words('english'):
                new_list.append(word)
        print new_list

remove_stopwords(tlist)

我正在尝试使用NLTK语料库提供的停用词。就像我说的那样，当我用print（new_list）测试这段代码时发生的一切都是我回到tlist集，就像它一样。

Answer 1

@Vardan的观点绝对正确。必须有两个循环，一个用于元组，另一个用于实际句子。但是我们可以将字符串转换为标记并检查停用词，而不是采用原始数据（以字母表示）。

以下代码应该可以正常工作：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
df = pd.read_csv('test.csv', delimiter=',')
tlist = [tuple(x) for x in df.values]
tlist = [(x.lower(), y.lower()) for x,y in tlist]
def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            total=''      #take an empty buffer string
            word_tokens=word_tokenize(word[0]) #convert the first string in tuple into tokens
            for txt in word_tokens: 
                    if txt not in stopwords.words('english'): #Check each token against stopword
                        total=total+' '+txt #append to the buffer
            new_list.append((total,word[1])) #append the total buffer along with pos/neg to list
        print new_list

remove_stopwords(tlist)
print tlist

Answer 2

for循环中的单词实际上是一个元组。因为tlist的格式为 [（a1，b1），（a2，b2）] （元组列表）。现在将每个元组与一个单词用停用词进行比较。如果你这样做，你会看到它：

def remove_stopwords(train_list):
        new_list = []
        for word in train_list:
            print(word)
            if word not in stopwords:
                new_list.append(word)
        print (new_list)

如果你想删除这些单词，你应该至少有两个循环，一个用于迭代列表，另一个用于迭代单词。这样的事情会起作用：

def remove_stopwords(train_list):
        new_list = []
        for tl in train_list:
            Words = tl[0].split()
            # tl would be  ('the beer was good.', 'pos')
            for word in Words: # words will be the , beer, was, good.
                if word not in stopwords:
                    new_list.append(word)
        print (new_list)

Answer 3

尝试这样：

def remove_stopwords(train_list):
        global new_list
        new_list = []
        for line in train_list:
            for word in line:
                if word not in stopwords.words('english'):
                    break
            new_list.append(word)
        return new_list

或者像这样：

def remove_stopwords(train_list):
        global new_list
        new_list = []
        for line, gr in train_list:
            for word in line:
                if word not in stopwords.words('english'):
                    line = line.replace(" %s " % word, ' ')
            new_list.append(word)
        return new_list

如何检查元组中是否包含单词，如果是，则将其删除

3 个答案:

尝试这样：

或者像这样：