打印给定字符串中的所有可能短语(单词的连续组合)

时间:2014-07-25 21:35:31

标签: python

我正在尝试在给定文本中打印短语。我希望能够打印文本中的每个短语,从2个单词到文本长度允许的最大单词数。我在下面编写了一个程序,可以打印长度不超过5个单词的所有短语,但是我无法找到更优雅的方式来打印所有可能的短语。

我对短语的定义=字符串中的连续单词,无论其含义如何。

def phrase_builder(i):
    phrase_length = 4
    phrase_list = []
    for x in range(0, len(i)-phrase_length):
        phrase_list.append(str(i[x]) + " " + str(i[x+1]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]) + " " + str(i[x+4]))
    return phrase_list

text = "the big fat cat sits on the mat eating a rat"

print phrase_builder(text.split())

这个输出是:

['the big', 'the big fat', 'the big fat cat', 'the big fat cat sits',
'big fat', 'big fat cat', 'big fat cat sits', 'big fat cat sits on',
'fat cat', 'fat cat sits', 'fat cat sits on', 'fat cat sits on the',
'cat sits', 'cat sits on', 'cat sits on the', 'cat sits on the mat',
'sits on', 'sits on the', 'sits on the mat', 'sits on the mat eating',
'on the', 'on the mat', 'on the mat eating', 'on the mat eating a',
'the mat', 'the mat eating', 'the mat eating a', 'the mat eating a rat']

我希望能够打印"the big fat cat sits on the mat eating""fat cat sits on the mat eating a rat"等短语。

有人可以提供一些建议吗?

4 个答案:

答案 0 :(得分:15)

只需使用itertools.combinations

即可
from itertools import combinations
text = "the big fat cat sits on the mat eating a rat"
lst = text.split()
for start, end in combinations(range(len(lst)), 2):
    print lst[start:end+1]

输出:

['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['the', 'big', 'fat', 'cat', 'sits', 'on']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']
['big', 'fat', 'cat', 'sits', 'on']
['big', 'fat', 'cat', 'sits', 'on', 'the']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['fat', 'cat']
['fat', 'cat', 'sits']
['fat', 'cat', 'sits', 'on']
['fat', 'cat', 'sits', 'on', 'the']
['fat', 'cat', 'sits', 'on', 'the', 'mat']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['cat', 'sits']
['cat', 'sits', 'on']
['cat', 'sits', 'on', 'the']
['cat', 'sits', 'on', 'the', 'mat']
['cat', 'sits', 'on', 'the', 'mat', 'eating']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['sits', 'on']
['sits', 'on', 'the']
['sits', 'on', 'the', 'mat']
['sits', 'on', 'the', 'mat', 'eating']
['sits', 'on', 'the', 'mat', 'eating', 'a']
['sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['on', 'the']
['on', 'the', 'mat']
['on', 'the', 'mat', 'eating']
['on', 'the', 'mat', 'eating', 'a']
['on', 'the', 'mat', 'eating', 'a', 'rat']
['the', 'mat']
['the', 'mat', 'eating']
['the', 'mat', 'eating', 'a']
['the', 'mat', 'eating', 'a', 'rat']
['mat', 'eating']
['mat', 'eating', 'a']
['mat', 'eating', 'a', 'rat']
['eating', 'a']
['eating', 'a', 'rat']
['a', 'rat']

答案 1 :(得分:2)

首先,你需要弄清楚如何以相同的方式写出所有这四行。不要手动连接单词和空格,而是使用join方法:

phrase_list.append(" ".join(str(i[x+y]) for y in range(2))
phrase_list.append(" ".join(str(i[x+y]) for y in range(3))
phrase_list.append(" ".join(str(i[x+y]) for y in range(4))
phrase_list.append(" ".join(str(i[x+y]) for y in range(5))

如果join方法中的理解不清楚,请按以下方式手动编写:

phrase = []
for y in range(2):
    phrase.append(str(i[x+y]))
phrase_list.append(" ".join(phrase))

完成后,用循环替换这四行是很简单的:

for length in range(2, phrase_length):
    phrase_list.append(" ".join(str(i[x+y]) for y in range(length))

您可以通过其他几种方式单独简化此操作。

首先,使用切片可以更轻松地完成i[x+y] for y in range(length)i[x:x+length]

我猜测i已经是一个字符串列表,因此您可以摆脱str次来电。

此外,range默认从0开始,因此您可以将其关闭。

虽然我们正在使用它,但如果您使用有意义的变量名称(例如words而非i),则可以更轻松地考虑代码。

所以:

def phrase_builder(words):
    phrase_length = 4
    phrase_list = []
    for i in range(len(words) - phrase_length):
        phrase_list.append(" ".join(words[i:i+phrase_length]))
    return phrase_list

现在你的循环很简单,你可以把它变成一种理解,整个事情就是一个单行:

def phrase_builder(words):
    phrase_length = 4
    return [" ".join(words[i:i+phrase_length]) 
            for i in range(len(words) - phrase_length)]

最后一件事:@SoundDefense问道,你确定你不想“吃老鼠”吗?它从最后开始不到5个单词,但它在文本中是一个3个单词的短语。

如果您确实需要,只需删除- phrase_length部分。

答案 2 :(得分:1)

你需要有一种系统的方法来列举每一个可能的短语。

一种方法是从每个单词开始,然后生成以该单词开头的所有可能短语。

def phrase_builder(my_words):
   phrases = []
   for i, word in enumerate(my_words):
     phrases.append(word)
     for nextword in my_words[i+1:]:
        phrases.append(phrases[-1] + " " + nextword)
     # Remove the one-word phrase.
     phrases.remove(word)
   return phrases



text = "the big fat cat sits on the mat eating a rat"

print phrase_builder(text.split())

答案 3 :(得分:1)

我认为最简单的方法是迭代start列表中所有可能的endwords位置,并为各个单词子列表生成短语:< / p>

def phrase_builder(words):
    for start in range(0, len(words)-1):
        for end in range(start+2, len(words)+1):
            yield ' '.join(words[start:end])

text = "the big fat cat sits on the mat eating a rat"
for phrase in phrase_builder(text.split()):
    print phrase

输出:

the big
the big fat
...
the big fat cat sits on the mat eating a rat
...
sits on the mat eating a
...
eating a rat
a rat