从字符串中排除列表中显示的单词

时间:2015-06-28 15:19:10

标签: python regex

我有这样的清单:

stopwords = ['a', 'and', 'is']

和这样的句子:

sentence = 'A Mule is Eating and drinking.'

预期产出:

reduced = ['mule', 'eating', 'drinking']

我到目前为止:

reduced = filter(None, re.match(r'\W+', sentence.lower()))

现在你将如何过滤掉停用词(注意大写到小写的转换以及标点符号的省略)?

7 个答案:

答案 0 :(得分:2)

过滤器表达式错误。将其更改为:

>>> reduced = filter(lambda w: w not in stopwords, re.split(r'\W+', sentence.lower()))

第一个参数是过滤标准。另请注意,要分割句子,您需要re.split而不是re.match

>>> list(reduced)
['mule', 'eating', 'drinking']

答案 1 :(得分:1)

您不需要正则表达式来过滤停用词,一种方法是拆分您的字符串并重建它而不使用列表中的字符串:

lst = sentence.split()
' '.join([w for w in lst if w not in stopwords])

当你有一个重复自己的模式时,正则表达式很有用,而不是当你想要匹配完全匹配时。

答案 2 :(得分:1)

如果你没有使用正则表达式路线,你可以使用列表理解string.split()并检查字符串是否不在stopwords

示例 -

>>> stopwords = ['a', 'and', 'is']
>>> sentence = 'a mule is eating and drinking'
>>> reduced = [s.lower() for s in sentence.split() if s.lower() not in stopwords]
>>> reduced
['mule', 'eating', 'drinking']

作为性能优势,您还可以使用set函数将停用词列表转换为set(),然后在其中执行查找,因为set中的搜索是O( 1)。

答案 3 :(得分:1)

如果正在使用文本材料,则值得观察NLTK(自然语言工具包)是用于分析文本的“框架”。它不仅具有许多在处理文本时所需的内置函数,NLTK Book也是同时学习Python和文本分析的教程。这有多酷!

例如,

from nltk.corpus import stopwords
stopwords.words('english')

给出了英语中127个停用词的列表。该列表中的前几个是:我,我,我,我自己和我们。请注意,这些单词是小写的。

因此,上面提到的问题,如NLTK以特定方式处理,看起来像:

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

raw = 'All human rules are more or less idiotic, I suppose. 
It is best so, no doubt. The way it is now, the asylums can 
hold the sane people, but if we tried to shut up the insane 
we should run out of building materials. -- Mark Twain'

tokenizer = RegexpTokenizer(r'\w+')
words  = tokenizer.tokenize(raw)

sw = stopwords.words('english')

reduce = [w.lower() for w in words if w.lower() not in sw]

该行:

tokenizer = RegexpTokenizer(r'\w+')

是令牌化程序用来去除标点符号的正则表达式。很多时候,词汇是重要的。例如,“人类”是“人类”的词干,分析以名词“人类”为中心而不是其各种形式。如果我们需要保持这样的细节,可以改进正则表达式。毫无疑问,有时间投入建立强大的正则表达式,但实践是完美的。

如果您不介意学习NLTK的开销,例如,因为您正在进行常规文本分析,那么可能需要考虑这一点。

答案 4 :(得分:0)

您可以删除标点符号:

from string import punctuation
stopwords = set(['a', 'and', 'is'])

sentence = 'A Mule is Eating and drinking.'

print([word.strip(punctuation) for word in sentence.lower().split() if word not in stopwords])
['mule', 'eating', 'drinking']

使用正则表达式是一种错误的方法,因为如果要打算使用正则表达式,请在"Foo's""foo"中分割单"s"个单词。 ; t re.split使用findall和filter,而不必无缘无故地过滤掉空字符串:

stopwords = set(['a', 'and', 'is'])

reduced = filter(lambda w: w not in stopwords, re.findall(r"\w+", sentence.lower()))
print(reduced)
['mule', 'eating', 'drinking']

保持"骡子"作为带正则表达式的单个单词:

sentence = 'A Mule"s  Eating and drinking.'
reduced = filter(lambda w: w not in stopwords, re.findall(r"\w+\S\w+|\w+", sentence.lower()))
print(reduced)
'mule"s', 'eating', 'drinking']

你自己的正则表达式和接受的答案会将这个单词分成两个部分,我怀疑你真正想要的是什么:

In [7]: sentence = 'A Mule"s Eating and drinking.'
In [8]: reduced = filter(lambda w: w not in stopwords, re.split(r'\W+', sentence.lower()))
In [9]: reduced
Out[9]: ['mule', 's', 'eating', 'drinking', '']

答案 5 :(得分:0)

使用此代码,您将删除停用词。它适用于PySpark

stopwordsT=["a","about","above","above","across","after","afterwards","again","against","all","almost","alone","along","already","also","although","always","am","among", "amongst", "amoungst","amount", "an","and","another","any","anyhow","anyone","anything","anyway","anywhere","are","around","as","at","back","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","below","beside","besides","between","beyond","bill","both","bottom","but","by","call","can","cannot","cant","co","con","could","couldnt","cry","de","describe","detail","do","done","down","due","during","each","eg","eight","either","eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the"]
sentence = "about esto alone es a una already pruba across para after ver too si top funciona"
lst = sentence.split()
' '.join([w for w in lst if w not in stopwordsT])

答案 6 :(得分:-1)

<h3 id="assignments">Assignments</h3>
    <div id="section">
      Assignment Count <input type="number" name="assignment_count">
    </div>
相关问题