Question

这与以下问题相关 - Searching for Unicode characters in Python

我有这样的字符串 -

sentence = 'AASFG BBBSDC FEKGG SDFGF'

我把它分开并获得如下的单词列表 -

sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF']

我使用以下代码搜索单词的一部分并获得整个单词 -

[word for word in sentence.split() if word.endswith("GG")]

返回['FEKGG']

现在我需要找出这个词背后的内容。

例如，当我搜索＆＃34; GG＆＃34;它返回['FEKGG']。它也应该得到

behind = 'BBBSDC'
infront = 'SDFGF'

Answer 1

Using this generator:

如果您有以下字符串（从原始编辑）：

sentence = 'AASFG BBBSDC FEKGG SDFGF KETGG'

def neighborhood(iterable):
    iterator = iter(iterable)
    prev = None
    item = iterator.next()  # throws StopIteration if empty.
    for next in iterator:
        yield (prev,item,next)
        prev = item
        item = next
    yield (prev,item,None)

matches = [word for word in sentence.split() if word.endswith("GG")]
results = []

for prev, item, next in neighborhood(sentence.split()):
    for match in matches:
        if match == item:
            results.append((prev, item, next))

返回：

[('BBBSDC', 'FEKGG', 'SDFGF'), ('SDFGF', 'KETGG', None)]

Answer 2

这是一种可能性：

words = sentence.split()
[pos] = [i for (i, word) in enumerate(words) if word.endswith("GG") ]
behind = words[pos - 1]
infront = words[pos + 1]

您可能需要注意边缘情况，例如"…GG"没有出现，出现多次，或者是第一个和/或最后一个字。就目前而言，任何这些都会引发异常，这可能是正确的行为。

使用正则表达式的完全不同的解决方案避免了首先将字符串拆分为数组：

match = re.search(r'\b(\w+)\s+(?:\w+GG)\s+(\w+)\b', sentence)
(behind, infront) = m.groups()

Answer 3

这是一种方式。如果“G”字位于句子的开头或结尾，则前后元素将为None。

words = sentence.split()
[(infront, word, behind) for (infront, word, behind) in 
 zip([None] + words[:-1], words, words[1:] + [None])
 if word.endswith("GG")]

Answer 4

sentence = 'AASFG BBBSDC FEKGG SDFGF AAABGG FOOO EEEGG'

def make_trigrams(l):
    l = [None] + l + [None]

    for i in range(len(l)-2):
        yield (l[i], l[i+1], l[i+2])


for result in [t for t in make_trigrams(sentence.split()) if t[1].endswith('GG')]:
    behind,match,infront = result

    print 'Behind:', behind
    print 'Match:', match
    print 'Infront:', infront, '\n'

输出：

Behind: BBBSDC
Match: FEKGG
Infront: SDFGF

Behind: SDFGF
Match: AAABGG
Infront: FOOO

Behind: FOOO
Match: EEEGG
Infront: None

Answer 5

另一个基于itertools的选项，对大型数据集可能更加内存友好

from itertools import tee, izip
def sentence_targets(sentence, endstring):
   before, target, after = tee(sentence.split(), 3)
   # offset the iterators....
   target.next()
   after.next()
   after.next()
   for trigram in izip(before, target, after):
       if trigram[1].endswith(endstring): yield trigram

编辑：修正了拼写错误

在Python列表的前面和后面找到单词

5 个答案: