查找由两个字符串共享的所有n字长子字符串的最大长度

时间:2013-11-28 13:32:36

标签: python string algorithm pattern-matching substring

我正在制作一个Python脚本,它可以找到由两个字符串共享的所有n字长子字符串的(尽可能长)长度,忽略尾随标点符号。给出两个字符串:

  

“这是一个示例字符串”

     

“这也是一个示例字符串”

我希望脚本能够识别这些字符串是否有一个共同的2个单词的序列(“this is”),后跟一个共同的3个单词的序列(“一个样本字符串”)。这是我目前的做法:

a = "this is a sample string"
b = "this is also a sample string"

aWords = a.split()
bWords = b.split()

#create counters to keep track of position in string
currentA = 0
currentB = 0

#create counter to keep track of longest sequence of matching words
matchStreak = 0

#create a list that contains all of the matchstreaks found
matchStreakList = []

#create binary switch to control the use of while loop
continueWhileLoop = 1

for word in aWords:
    currentA += 1

    if word == bWords[currentB]:
        matchStreak += 1

        #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
        if currentB + 1 < len(bWords):
            currentB += 1

        #in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
        if currentA == len(aWords):
            matchStreakList.append(matchStreak)

    elif word != bWords[currentB]:

        #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
        if matchStreak >= 1:
            matchStreakList.append(matchStreak)
        matchStreak = 0

        while word != bWords[currentB]:

            #the two words don't match. If you can move b forward one word, do so, then check for another match
            if currentB + 1 < len(bWords):
                currentB += 1

            #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
            elif currentB + 1 == len(bWords):
                currentB = 0
                break

        if word == bWords[currentB]:
            matchStreak += 1

            #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
            if currentB + 1 < len(bWords):
                currentB += 1
            elif currentB + 1 == len(bWords):

                #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
                if currentA == len(aWords):
                    matchStreakList.append(matchStreak)
                currentB = 0
                break

print matchStreakList

此脚本正确输出公共字长子串(2,3)的(最大)长度,并且对目前为止的所有测试都这样做了。我的问题是:是否有一对两个字符串,上面的方法不起作用?更重要的是:是否存在可用于查找两个字符串共享的所有n字长子串的最大长度的现有Python库或众所周知的方法?

[这个问题不同于最常见的子串问题,这只是我正在寻找的特殊情况(因为我想找到所有常见的子串,而不仅仅是最常见的子串)。 This SO post建议诸如1)聚类分析,2)编辑距离例程和3)最长公共序列算法等方法可能是合适的方法,但我没有找到任何可行的解决方案,我的问题可能稍微容易一些在链接中提到的,因为我正在处理由空格限制的单词。]

修改

我在这个问题上开始获得赏金。如果它会帮助其他人,我想澄清一些快速点。首先,下面由@DhruvPathak提出的有用答案并未找到由两个字符串共享的所有最大长度的n字长子字符串。例如,假设我们分析的两个字符串是:

  

“当他们刚出生时,他们都是一张白纸   但是它们会被每只鹅毛笔潦草地涂抹并涂抹“

  

“当你第一次来的时候,你们都是白色的,一张可爱的,一尘不染的纸   出生;但你要被每只鹅潦草地涂抹   套筒轴“

在这种情况下,最大长n字长子串(无视尾随标点符号)的列表是:

all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every

使用以下例程:

#import required packages
import difflib

#define function we'll use to identify matches
def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()

print matches(a,b)

获得输出:

['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']

首先,我不确定如何从这个列表中选择仅包含整个单词的子串。在第二位,该列表不包括“是”,这是所需的最大长公共n字长子串之一。有没有一种方法可以找到这两个字符串共享的所有最长n字长的子字符串(“你们都......”和“它们都是......”)?

4 个答案:

答案 0 :(得分:5)

这里仍有歧义,我不想花时间争论它们。但我认为无论如何我都可以添加一些有用的东西; - )

我编写了Python的difflib.SequenceMatcher,花了很多时间来找到预期案例的快速方法来找到最长的公共子串。从理论上讲,应该使用“后缀树”或相关的“后缀数组”来增加“最长公共前缀数组”(引号中的短语是搜索术语,如果你想谷歌更多)。那些可以在最坏情况线性时间内解决问题。但是,就像有时候的情况一样,最坏情况的线性时间算法极其复杂和微妙,并且遭受很大的常数因素 - 如果要搜索给定的语料库,它们仍然可以获得巨大回报许多次,但这不是Python的difflib的典型情况,也不像你的情况。

无论如何,我在这里的贡献是重写SequenceMatcher的{​​{1}}方法以返回所有它在路上找到的(本地)最大匹配。注意:

  1. 我将使用Raymond Hettinger给你的find_longest_match()函数,但没有转换为小写。转换为小写会导致输出与您所说的不完全相同。

  2. 然而,正如我在评论中已经提到的那样,这确实会输出“quill”,这不在您想要的输出列表中。我不知道为什么不是,因为“quill”确实出现在两个输入中。

  3. 以下是代码:

    to_words()

    然后:

    import re
    def to_words(text):
        'Break text into a list of words without punctuation'
        return re.findall(r"[a-zA-Z']+", text)
    
    def match(a, b):
        # Make b the longer list.
        if len(a) > len(b):
            a, b = b, a
        # Map each word of b to a list of indices it occupies.
        b2j = {}
        for j, word in enumerate(b):
            b2j.setdefault(word, []).append(j)
        j2len = {}
        nothing = []
        unique = set() # set of all results
        def local_max_at_j(j):
            # maximum match ends with b[j], with length j2len[j]
            length = j2len[j]
            unique.add(" ".join(b[j-length+1: j+1]))
        # during an iteration of the loop, j2len[j] = length of longest
        # match ending with b[j] and the previous word in a
        for word in a:
            # look at all instances of word in b
            j2lenget = j2len.get
            newj2len = {}
            for j in b2j.get(word, nothing):
                newj2len[j] = j2lenget(j-1, 0) + 1
            # which indices have not been extended?  those are
            # (local) maximums
            for j in j2len:
                if j+1 not in newj2len:
                    local_max_at_j(j)
            j2len = newj2len
        # and we may also have local maximums ending at the last word
        for j in j2len:
            local_max_at_j(j)
        return unique
    

    显示:

    a = "They all are white a sheet of spotless paper " \
        "when they first are born but they are to be " \
        "scrawled upon and blotted by every goose quill"
    b = "You are all white, a sheet of lovely, spotless " \
        "paper, when you first are born; but you are to " \
        "be scrawled and blotted by every goose's quill"
    
    print match(to_words(a), to_words(b))
    

    编辑 - 工作原理

    许多序列匹配和对齐算法最好被理解为在二维矩阵上工作,具有计算矩阵条目的规则,然后解释条目的含义。

    对于输入序列set(['all', 'and blotted by every', 'first are born but', 'are to be scrawled', 'are', 'spotless paper when', 'white a sheet of', 'quill']) a,请使用b行和M列来生成矩阵len(a)。在此应用程序中,我们希望len(b)包含以M[i, j]a[i]结尾的最长公共连续子序列的长度,并且计算规则非常容易:

    1. b[j] if M[i, j] = 0
    2. a[i] != b[j] if M[i, j] = M[i-1, j-1] + 1(我们假设一个越界矩阵引用静默返回0)。
    3. 在这种情况下,解释也非常简单:本地最大非空匹配以a[i] == b[j]a[i]结尾,长度为b[j], if和仅当M[i, j]非零但M[i, j]为0或超出范围时。

      您可以使用这些规则来编写非常简单的&amp;紧凑的代码,有两个循环,可以正确计算M[i+1, j+1]这个问题。缺点是代码将采用(最佳,平均和最差情况)M时间空间。

      虽然最初可能令人困惑,但我发布的代码正是如上所述。连接是模糊的,因为代码在几种方面针对预期情况进行了大量优化:

      • 而不是一次传递来计算O(len(a) * len(b)),而是通过M在一次传递中交替解释结果,计算和解释的另一个传递。

      • 因此,不需要存储整个矩阵。而是仅同时存在当前行(a)和前一行(newj2len)。

      • 因为这个问题中的矩阵通常大多数为零,所以这里的一行稀疏地表示,通过dict将列索引映射到非零值。零条目是“免费的”,因为它们永远不会被明确存储。

      • 处理行时,不需要遍历每一列:预先计算的j2len dict告诉我们当前行中的有趣列索引(与当前行匹配的那些列{{1}来自b2j)。

      • 最后,部分是偶然的,所有先前的优化都以这样的方式合谋,即从来不需要知道当前行的索引,因此我们也不必费心计算。

      编辑 - 污垢简单版

      这是直接实现2D矩阵的代码,没有尝试优化(除了word通常可以避免显式存储0个条目)。这非常简单,简单易行:

      a

      当然;-)返回与我最初发布的优化Counter相同的结果。

      编辑 - 另一个没有字典

      只是为了好玩:-)如果你有矩阵模型,这段代码很容易理解。关于这个特殊问题的一个值得注意的事情是矩阵单元的值仅取决于沿着对角线到单元格西北方向的值。所以它“足够好”只是为了遍历所有主要对角线,从西边界和北边界的所有细胞向东南方向前进。这样,无论输入的长度如何,都只需要很小的恒定空间:

      def match(a, b):
          from collections import Counter
          M = Counter()
          for i in range(len(a)):
              for j in range(len(b)):
                  if a[i] == b[j]:
                      M[i, j] = M[i-1, j-1] + 1
          unique = set()
          for i in range(len(a)):
              for j in range(len(b)):
                  if M[i, j] and not M[i+1, j+1]:
                      length = M[i, j]
                      unique.add(" ".join(a[i+1-length: i+1]))
          return unique
      

答案 1 :(得分:4)

您的帖子中确实嵌入了四个问题。

1)如何将文字拆分成文字?

有很多方法可以做到这一点取决于你算作一个单词,你是否关心案例,是否允许收缩等。正则表达式允许你实现你选择的分词规则。我通常使用的是r"[a-z'\-]+"。捕获don't等收缩,并允许使用mother-in-law等带连字符的单词。

2)什么数据结构可以加速搜索公共子序列?

创建显示每个单词的位置图。例如,在句子you should do what you like中,you的映射为{"you": [0, 4]},因为它出现两次,一次位于零位置,一次位于位置四位。

手头有位置图,循环起点以比较n长度子序列是一件简单的事情。

3)如何找到常见的n长度子序列?

循环其中一个句子中的所有单词。对于每个这样的单词,找到它在另一个序列中出现的位置(使用位置图)并测试两个n长度的切片是否相等。

4)如何找到最长的公共子序列?

max()函数找到最大值。它需要一个关键函数,如 len()来确定比较的基础。

以下是一些可以自定义解决问题的工作代码:

import re

def to_words(text):
    'Break text into a list of lowercase words without punctuation'
    return re.findall(r"[a-z']+", text.lower())

def starting_points(wordlist):
    'Map each word to a list of indicies where the word appears'
    d = {}
    for i, word in enumerate(wordlist):
        d.setdefault(word, []).append(i)
    return d

def sequences_in_common(wordlist1, wordlist2, n=1):
    'Generate all n-length word groups shared by two word lists'
    starts = starting_points(wordlist2)
    for i, word in enumerate(wordlist1):
        seq1 = wordlist1[i: i+n]
        for j in starts.get(word, []):
            seq2 = wordlist2[j: j+n]
            if seq1 == seq2 and len(seq1) == n:
                yield ' '.join(seq1)

if __name__ == '__main__':

    t1 = "They all are white a sheet of spotless paper when they first are " \
         "born but they are to be scrawled upon and blotted by every goose quill"

    t2 = "You are all white, a sheet of lovely, spotless paper, when you first " \
         "are born; but you are to be scrawled and blotted by every goose's quill"

    w1 = to_words(t1)
    w2 = to_words(t2)

    for n in range(1,10):
        matches = list(sequences_in_common(w1, w2, n))
        if matches:
            print(n, '-->', max(matches, key=len))

答案 2 :(得分:2)

对于这种情况,

difflib模块将是一个很好的候选者,请参阅get_matching_blocks

import difflib

def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

first_string = "this is a sample string"
second_string = "this is also a sample string"
print matches(second_string, first_string )

演示http://ideone.com/Ca3h8Z

答案 3 :(得分:0)

稍微修改一下,匹配不是字符而是单词,我想会这样做:

def matche_words(first_string,second_string):
    l1 = first_string.split()
    l2 = second_string.split()
    s = difflib.SequenceMatcher(None, l1, l2)
    match = [l1[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

演示:

>>> print '\n'.join(map(' '.join, matches(a,b)))
all
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
quill
相关问题