Question

我有一个不同的字符串，其中肯定包含myWord（在某些情况下多次出现，只应处理第一次出现的字符串），但是字符串的长度不同。其中一些包含数百个子字符串，某些包含仅几个子字符串。

我想找到一种从文本中获取摘要的解决方案。规则如下：代码段前后应包含myWord和X单词。

类似这样的东西：

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."

myWord = "sentence"

假设我想从“句子”一词和正负3个词中获取内容，例如：

"example lorem ipsum sentence for a Stackoverflow"

我可以创建一个有效的解决方案，但是它使用字符数来剪切代码段，而不是使用myWord之前/之后的单词数。所以我的问题是，还有没有更多合适的解决方案，也许是内置的Python函数可以实现我的目标？

当前使用的解决方案：

myWord = "mollis"
rawText = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse sit amet arcu vulputate, sodales arcu non, finibus odio. Aliquam sed tincidunt nisi, eu scelerisque lectus. Curabitur in nibh enim. Duis arcu ante, mollis sed iaculis non, hendrerit ut odio. Curabitur gravida condimentum posuere. Sed et arcu finibus felis auctor mollis et id risus. Nam urna tellus, ultricies a aliquam at, euismod et erat. Cras pretium venenatis ornare. Donec pulvinar dui eu dui facilisis commodo. Vivamus eget ultrices turpis, vel egestas lacus."

# The index where the word is located
wordIndexNumber = rawText.lower().find("%s" % (myWord,))

# The total length of the text (in chars)
textLength = len(rawText)

textPart2 = len(rawText)-wordIndexNumber

if wordIndexNumber < 80:
    textIndex1 = 0
else:
    textIndex1 = wordIndexNumber - 80

if textPart2 < 80:
    textIndex2 = textLength
else:
    textIndex2 = wordIndexNumber + 80

snippet = rawText[textIndex1:textIndex2]

print (snippet)

Answer 1

这是使用字符串切片的一种方法。

演示：

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."
myWord = "sentence"
rawTextList = rawText.split()
frontVal = " ".join( rawTextList[rawTextList.index(myWord)-3:rawTextList.index(myWord)] )
backVal = " ".join( rawTextList[rawTextList.index(myWord):rawTextList.index(myWord)+4] )

print("{} {}".format(frontVal, backVal))

输出：

example lorem ipsum sentence for a Stackoverflow

Answer 2

这是使用数组切片的解决方案

def get_context_around(text, word, accuracy):
    words = text.split()
    first_hit = words.index(word)

    return ' '.join(words[first_hit - accuracy:first_hit + accuracy + 1])


raw_text= "This is an example lorem ipsum sentence for a Stackoverflow question."
my_word = "sentence"
print(get_context_around(raw_text, my_word, accuracy=3)) # example lorem ipsum sentence for a Stackoverflow

根据子字符串匹配和字符串索引从字符串中获取子字符串

2 个答案: