在Python中格式化正则表达式

时间:2014-01-08 10:50:18

标签: python regex

我有一个单词列表

wordlist = ['hypothesis' , 'test' , 'results' , 'total']

我有一句话

sentence = "These tests will benefit in the long run."

我想查看wordlist中的字词是否在句子中。我知道您可以使用以下方法检查它们是否是句子中的子串:

for word in wordlist:
    if word in sentence:
        print word

然而,使用子字符串,我开始匹配不在wordlist中的字词,例如此处test将在句子中显示为子字符串,即使它是tests这句话。我可以通过使用正则表达式来解决我的问题,但是,是否可以以每个新单词格式化的方式实现正则表达式,这意味着如果我想查看单词是否在句子中:

for some_word_goes_in_here in wordlist:
    if re.search('.*(some_word_goes_in_here).*', sentence):
         print some_word_goes_in_here

所以在这种情况下,正则表达式会将some_word_goes_in_here解释为需要搜索的模式,而不是some_word_goes_in_here的值。有没有办法格式化some_word_goes_in_here的输入,以便正则表达式搜索some_word_goes_in_here的值?

3 个答案:

答案 0 :(得分:2)

使用\b字边界来测试单词:

for word in wordlist:
    if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
        print '{} matched'.format(word)

但您也可以将句子分成单独的单词。使用单词列表的集合可以提高测试效率:

words = set(wordlist)
if words.intersection(sentence.split()):
    # no looping over `words` required.

演示:

>>> import re
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total']
>>> sentence = "These tests will benefit in the long run."
>>> for word in wordlist:
...     if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
...         print '{} matched'.format(word)
... 
>>> words = set(wordlist)
>>> words.intersection(sentence.split())
set([])
>>> sentence = 'Lets test this hypothesis that the results total the outcome'
>>> for word in wordlist:
...     if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
...         print '{} matched'.format(word)
... 
hypothesis matched
test matched
results matched
total matched
>>> words.intersection(sentence.split())
set(['test', 'total', 'hypothesis', 'results'])

答案 1 :(得分:1)

尝试使用:

if re.search(r'\b' + word + r'\b', sentence):

\b是单词边界,在您的单词和非单词字符(单词字符是任何字母,数字或下划线)之间匹配。

例如,

>>> import re
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total']
>>> sentence = "The total results for the test confirm the hypothesis"
>>> for word in wordlist:
...     if re.search(r'\b' + word + r'\b', sentence):
...             print word
...
hypothesis
test
results
total

用你的字符串:

>>> sentence = "These tests will benefit in the long run."
>>> for word in wordlist:
...     if re.search(r'\b' + word + r'\b', sentence):
...          print word
...
>>>

什么都没打印

答案 2 :(得分:1)

我会用这个:

words = "hypothesis test results total".split()
# ^^^ but you can use your literal list if you prefer that
for word in words:
  if re.search(r'\b%s\b' % (word,), sentence):
    print word

您甚至可以使用单个正则表达式加快速度:

for foundWord in re.findall(r'\b' + r'\b|\b'.join(words) + r'\b', sentence):
  print foundWord