Question

我正在阅读GeeksforGeeks文档。有一个问题，Sentence that contains all the given phrases。

详细信息如下：给定一个句子列表和一个短语列表。该任务是查找哪些短语包含一个短语中的所有单词，并为每个短语打印包含给定短语的句子编号。

例如：输入：

sent = ["Strings are an array of characters", 
    "Sentences are an array of words"] 
ph = ["an array of", "sentences are strings"]

输出：

Phrase1:
1 2
Phrase2:
NONE

代码：

# Python program to find the sentence 
# that contains all the given phrases  
def getRes(sent, ph): 
    sentHash = dict() 

    # Loop for adding hased sentences to sentHash 
    for s in range(1, len(sent)+1): 
        sentHash[s] = set(sent[s-1].split()) 

    # For Each Phrase 
    for p in range(0, len(ph)): 
        print("Phrase"+str(p + 1)+":") 

        # Get the list of Words 
        wordList = ph[p].split() 
        res = [] 

        # Then Check in every Sentence 
        for s in range(1, len(sentHash)+1): 
            wCount = len(wordList) 

            # Every word in the Phrase 
            for w in wordList: 
                if w in sentHash[s]: 
                    wCount -= 1

            # If every word in phrase matches 
            if wCount == 0: 

            # add Sentence Index to result Array 
                res.append(s) 
        if(len(res) == 0): 
            print("NONE") 
        else: 
            print('% s' % ' '.join(map(str, res))) 

# Driver Function 
def main(): 
    sent = ["Strings are an array of characters", 
    "Sentences are an array of words"] 
    ph = ["an array of", "sentences are strings"] 
    getRes(sent, ph) 

main()

这是正确的。但是我想知道如何优化答案以减少时间复杂度或使其运行更快。我也在解决类似的问题，所以这就是为什么我要问。非常感谢您能帮助我。

Answer 1

您当前的算法大约运行O（| sent | * | phrase | * k），其中k是句子中平均单词数。 Patrik的答案将k降低到词组中单词的平均数量，在您的情况下，该数量应小于10，因此是一个很大的改进。

可能无法改善最坏的情况，但是我们仍然可以改善平均情况。想法是建立一个索引，将出现在句子中的所有单词作为键，并建立一个以该单词为值的句子索引列表。

这样，我们就可以检查给定的短语，每个单词有多少个句子，并只需较少的元素就可以遍历列表。例如，如果您的短语中没有句子，那么我们避免完全迭代该短语的句子。

from collections import Counter
from collections import defaultdict

def containsQty(sentence, phrase):
    qty = 100000
    for word in phrase:
        qty = min(qty, int(sentence[word] / phrase[word]))
        if qty == 0:
            break
    return qty

sent = ["bob and alice like to text each other", "bob does not like to ski but does not like to fall", "alice likes to ski"] 
ph = ["bob alice", "alice", "like"]

sent = [Counter(word.lower() for word in sentence.split()) for sentence in sent]
ph   = [Counter(word.lower() for word in sentence.split()) for sentence in ph]

indexByWords = defaultdict(list)

for index, counter in enumerate(sent, start = 1):
    for word in counter.keys():
        indexByWords[word].append(index)


for i, phrase in enumerate(ph, start=1):
    print("Phrase{}:".format(i))

    best = None
    minQty = len(sent) + 1
    for word in phrase.keys():
        if minQty > len(indexByWords[word]):
            minQty = len(indexByWords[word])
            best = indexByWords[word]

    matched = False
    for index in best:
        qty = containsQty(sent[index - 1], phrase)
        if qty > 0:
            matched = True
            print((str(index) + ' ') * qty)
    if not matched:
        print("NONE")

Answer 2

通过使用Counter模块中的collections类，可以使您的逻辑简单得多：

from collections import Counter

def contains(sentence, phrase):
    return all(sentence[word] >= phrase[word] for word in phrase)

sent = ["Strings are an array of characters", 
        "Sentences are an array of words"] 
ph = ["an array of", "sentences are strings"]

sent = [Counter(word.lower() for word in sentence.split()) for sentence in sent]
ph   = [Counter(word.lower() for word in sentence.split()) for sentence in ph]

for i, phrase in enumerate(ph, start=1):
    print("Phrase{}:".format(i))
    matches = [j for j, sentence in enumerate(sent, start=1) if contains(sentence, phrase)]
    if not matches:
        print("NONE")
    else:
        print(*matches)

这使我们可以一次计算每个句子中每个单词的数目，而不是每个短语一次。

Answer 3

我正在尝试使用以下代码在O（n ^ 2）中完成它：

import time
millis = int(round(time.time() * 1000))


sent = ["Strings are an array of characters",
        "Sentences are an array of words"]
ph = ["an array of","sentences are strings"]

s2 = [c.split() for c in ph]
s1=[d.split() for d in sent]
print(s2)
print(s1)

for i in s2:
    z=[]
    phcount=set(i)
    x = len(i)
    for idx1,j in enumerate(s1):
        sentcount=set(j)
        y = phcount.intersection(sentcount)
        if len(y)==x:
            z.append(idx1)
    if len(z)>0:
        print(z)
    else:
        print("NONE") 
millis2 = int(round(time.time() * 1000))          
print (millis2-millis)

优化包含所有给定短语的句子

3 个答案: