Question

我有一个填充了字典中单词的列表。我想找到一种方法来删除所有单词，只考虑在目标单词的开头处形成的根单词。

例如，单词“rodeo”将从列表中删除，因为它包含英语有效单词“rode”。 “打字机”将被删除，因为它包含英文有效的单词“type”。但是，单词“snicker”仍然有效，即使它包含单词“nick”，因为“nick”位于单词的中间而不是单词的开头。

我在想这样的事情：

 for line in wordlist:
        if line.find(...) --

但是我希望“if”语句然后遍历列表中的每个单词检查以查看它是否已找到，如果是，则将其自身从列表中删除，以便只保留根词。我是否必须创建wordlist的副本才能遍历？

Answer 1

因此，您有两个列表：要检查和可能删除的单词列表，以及有效单词列表。如果您愿意，可以将相同的列表用于这两个目的，但我假设您有两个列表。

对于速度，您应该将有效单词列表转换为集合。然后，您可以非常快速地检查该集合中是否有任何特定单词。然后，取出每个单词，并检查它的所有前缀是否都存在于有效单词列表中。由于“a”和“I”是英语中的有效单词，您是否会删除以“a”开头的所有有效单词，或者您是否有规则设置前缀的最小长度？

我正在使用Ubuntu安装中的文件/ usr / share / dict / words。这个文件中有各种奇怪的东西;例如，它似乎包含每个字母本身作为一个单词。因此，“k”在那里，“q”，“z”等等。据我所知，这些都不是单词，但出于某些技术原因它们可能在那里。无论如何，我决定只从我的有效单词列表中排除短于三个字母的任何内容。

以下是我提出的建议：

# build valid list from /usr/dict/share/words
wfile = "/usr/dict/share/words"
valid = set(line.strip() for line in open(wfile) if len(line) >= 3)

lst = ["ark", "booze", "kite", "live", "rodeo"]

def subwords(word):
    for i in range(len(word) - 1, 0, -1):
        w = word[:i]
        yield w

newlst = []
for word in lst:
    # uncomment these for debugging to make sure it works
    # print "subwords", [w for w in subwords(word)]
    # print "valid subwords", [w for w in subwords(word) if w in valid]
    if not any(w in valid for w in subwords(word)):
        newlst.append(word)

print(newlst)

如果你是单行的粉丝，你可以取消for列表并使用列表理解：

newlst = [word for word in lst if not any(w in valid for w in subwords(word))]

我认为这比它应该更简洁，我希望能够输入print语句进行调试。

嗯，来想一想，如果你只是添加另一个功能，那就太简洁了：

def keep(word):
    return not any(w in valid for w in subwords(word))

newlst = [word for word in lst if keep(word)]

如果你创建这样的函数，Python可以很容易阅读和理解，并给它们起好名字。

Answer 2

我假设您只有一个列表，您要从中删除在同一列表中包含前缀的任何元素。

#Important assumption here... wordlist is sorted

base=wordlist[0]                      #consider the first word in the list
for word in wordlist:                 #loop through the entire list checking if
    if not word.startswith(base):     # the word we're considering starts with the base
        print base                    #If not... we have a new base, print the current
        base=word                     #  one and move to this new one
    #else word starts with base
        #don't output word, and go on to the next item in the list
print base                            #finish by printing the last base

编辑：添加了一些注释以使逻辑更明显

Answer 3

我发现jkerian不是最好的（假设只有一个列表），我想解释原因。

这是我的代码版本（作为函数）：

wordlist = ["a","arc","arcane","apple","car","carpenter","cat","zebra"];

def root_words(wordlist):
    result = []
    base = wordlist[0]
    for word in wordlist:
        if not word.startswith(base):
            result.append(base)
            base=word
    result.append(base)
    return result;

print root_words(wordlist);

只要对单词列表进行排序（如果您愿意，可以在函数中执行此操作），这将在单个解析中获得结果。这是因为当您对列表进行排序时，由列表中的另一个单词组成的所有单词将直接在该根单词之后。例如任何落在你的特定列表中“arc”和“arcane”之间的东西，也会因为根词“arc”而被删除。

Answer 4

您应该使用内置的lambda功能。我认为它会让你的生活更轻松

words = ['rode', 'nick'] # this is the list of all the words that you have.
                         # I'm using 'rode' and 'nick' as they're in your example
listOfWordsToTry = ['rodeo', 'snicker']
def validate(w):
    for word in words:
        if w.startswith(word):
            return False
    return True

wordsThatDontStartWithValidEnglishWords = \
    filter(lambda x : validate(x), listOfWordsToTry)

这应该适用于您的目的，除非我误解了您的问题。

希望这有帮助

Answer 5

我写了一个答案，假设有两个列表，要修剪的列表和有效单词列表。在围绕我的回答的讨论中，我评论说也许一个特里解决方案会很好。

到底是什么，我继续写下来。

你可以在这里阅读一下特里：

http://en.wikipedia.org/wiki/Trie

对于我的Python解决方案，我基本上使用了词典。密钥是一系列符号，每个符号进入一个字典，另一个Trie实例作为数据。第二个字典存储“终端”符号，其标记Trie中“单词”的结尾。对于这个例子，“单词”实际上是单词，但原则上单词可以是任何可散列的Python对象序列。

Wikipedia示例显示了一个trie，其中键是字母，但可以是多个字母;它们可以是多个字母的序列。为简单起见，我的代码一次只使用一个符号作为键。

如果你将单词“cat”和单词“catch”都添加到trie中，那么将会有'c'，'a'和't'的节点（以及第二个'c'in“抓住”）。在'a'的节点级别，“终端”的字典将在其中具有't'（从而完成对“cat”的编码），并且同样在第二'c'的更深节点级别处的终端字典将会有'h'（完成“捕获”）。因此，在“cat”之后添加“catch”只意味着在终端字典中增加一个节点和一个条目。 trie结构是一种非常有效的方法来存储和索引一个非常大的单词列表。

def _pad(n):
    return " " * n

class Trie(object):
    def __init__(self):
        self.t = {}  # dict mapping symbols to sub-tries
        self.w = {}  # dict listing terminal symbols at this level

    def add(self, word):
        if 0 == len(word):
            return
        cur = self
        for ch in word[:-1]: # add all symbols but terminal
            if ch not in cur.t:
                cur.t[ch] = Trie()
            cur = cur.t[ch]
        ch = word[-1]
        cur.w[ch] = True  # add terminal

    def prefix_match(self, word):
        if 0 == len(word):
            return False
        cur = self
        for ch in word[:-1]: # check all symbols but last one
            # If you check the last one, you are not checking a prefix,
            # you are checking whether the whole word is in the trie.
            if ch in cur.w:
                return True
            if ch not in cur.t:
                return False
            cur = cur.t[ch]  # walk down the trie to next level
        return False

    def debug_str(self, nest, s=None):
        "print trie in a convenient nested format"
        lst = []
        s_term = "".join(ch for ch in self.w)
        if 0 == nest:
            lst.append(object.__str__(self))
            lst.append("--top--: " + s_term)
        else:
            tup = (_pad(nest), s, s_term)
            lst.append("%s%s: %s" % tup)
        for ch, d in self.t.items():
            lst.append(d.debug_str(nest+1, ch))
        return "\n".join(lst)

    def __str__(self):
        return self.debug_str(0)



t = Trie()


# Build valid list from /usr/dict/share/words, which has every letter of
# the alphabet as words!  Only take 2-letter words and longer.

wfile = "/usr/share/dict/words"
for line in open(wfile):
    word = line.strip()
    if len(word) >= 2:
        t.add(word)

# add valid 1-letter English words
t.add("a")
t.add("I")



lst = ["ark", "booze", "kite", "live", "rodeo"]
# "ark" starts with "a"
# "booze" starts with "boo"
# "kite" starts with "kit"
# "live" is good: "l", "li", "liv" are not words
# "rodeo" starts with "rode"

newlst = [w for w in lst if not t.prefix_match(w)]

print(newlst)  # prints: ['live']

Answer 6

我不想提供精确的解决方案，但我认为Python中有两个关键功能可以帮助您。

第一个，jkerian提到：string.startswith（）http://docs.python.org/library/stdtypes.html#str.startswith

第二种：过滤器（）http://docs.python.org/library/functions.html#filter

使用过滤器，您可以编写一个条件函数，该函数将检查单词是否是另一个单词的基础，如果是，则返回true。

对于列表中的每个单词，您需要迭代所有其他单词并评估条件使用过滤器，它可以返回根词的正确子集。

Answer 7

我只有一个列表 - 我想从中删除任何另一个词的前缀。

这是一个应该在O（n log N）时间和O（M）空间中运行的解决方案，其中M是返回列表的大小。运行时由排序控制。

l = sorted(your_list)
removed_prefixes = [l[g] for g in range(0, len(l)-1) if not l[g+1].startswith(l[g])] + l[-1:]

如果列表已排序，则索引N处的项目如果在索引N + 1处开始项目则为前缀。
最后，它会附加原始排序列表的最后一项，因为根据定义，它不是前缀。最后处理它还允许我们迭代超出范围的任意数量的索引。

如果您将禁止列表硬编码在另一个列表中：

 banned = tuple(banned_prefixes]
 removed_prefixes = [ i for i in your_list if not i.startswith(banned)]

这依赖于startswith接受元组的事实。它可能在接近N * M的地方运行，其中N是列表中的元素，M是banned中的元素。可以想象，Python可以做一些聪明的事情来让它更快一些。如果您喜欢OP并想忽略大小写，那么您需要在某些地方进行.lower()次呼叫。

Python-删除列表中包含其他单词的所有单词

7 个答案: