
时间:2013-11-28 13:32:36

标签: python string algorithm pattern-matching substring






我希望脚本能够识别这些字符串是否有一个共同的2个单词的序列(“this is”),后跟一个共同的3个单词的序列(“一个样本字符串”)。这是我目前的做法:

a = "this is a sample string"
b = "this is also a sample string"

aWords = a.split()
bWords = b.split()

#create counters to keep track of position in string
currentA = 0
currentB = 0

#create counter to keep track of longest sequence of matching words
matchStreak = 0

#create a list that contains all of the matchstreaks found
matchStreakList = []

#create binary switch to control the use of while loop
continueWhileLoop = 1

for word in aWords:
    currentA += 1

    if word == bWords[currentB]:
        matchStreak += 1

        #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
        if currentB + 1 < len(bWords):
            currentB += 1

        #in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
        if currentA == len(aWords):

    elif word != bWords[currentB]:

        #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
        if matchStreak >= 1:
        matchStreak = 0

        while word != bWords[currentB]:

            #the two words don't match. If you can move b forward one word, do so, then check for another match
            if currentB + 1 < len(bWords):
                currentB += 1

            #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
            elif currentB + 1 == len(bWords):
                currentB = 0

        if word == bWords[currentB]:
            matchStreak += 1

            #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
            if currentB + 1 < len(bWords):
                currentB += 1
            elif currentB + 1 == len(bWords):

                #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
                if currentA == len(aWords):
                currentB = 0

print matchStreakList


[这个问题不同于最常见的子串问题,这只是我正在寻找的特殊情况(因为我想找到所有常见的子串,而不仅仅是最常见的子串)。 This SO post建议诸如1)聚类分析,2)编辑距离例程和3)最长公共序列算法等方法可能是合适的方法,但我没有找到任何可行的解决方案,我的问题可能稍微容易一些在链接中提到的,因为我正在处理由空格限制的单词。]




“当他们刚出生时,他们都是一张白纸   但是它们会被每只鹅毛笔潦草地涂抹并涂抹“


“当你第一次来的时候,你们都是白色的,一张可爱的,一尘不染的纸   出生;但你要被每只鹅潦草地涂抹   套筒轴“


white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every


#import required packages
import difflib

#define function we'll use to identify matches
def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()

print matches(a,b)


['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']


4 个答案:

答案 0 :(得分:5)

这里仍有歧义,我不想花时间争论它们。但我认为无论如何我都可以添加一些有用的东西; - )

我编写了Python的difflib.SequenceMatcher,花了很多时间来找到预期案例的快速方法来找到最长的公共子串。从理论上讲,应该使用“后缀树”或相关的“后缀数组”来增加“最长公共前缀数组”(引号中的短语是搜索术语,如果你想谷歌更多)。那些可以在最坏情况线性时间内解决问题。但是,就像有时候的情况一样,最坏情况的线性时间算法极其复杂和微妙,并且遭受很大的常数因素 - 如果要搜索给定的语料库,它们仍然可以获得巨大回报许多次,但这不是Python的difflib的典型情况,也不像你的情况。


  1. 我将使用Raymond Hettinger给你的find_longest_match()函数,但没有转换为小写。转换为小写会导致输出与您所说的不完全相同。

  2. 然而,正如我在评论中已经提到的那样,这确实会输出“quill”,这不在您想要的输出列表中。我不知道为什么不是,因为“quill”确实出现在两个输入中。

  3. 以下是代码:



    import re
    def to_words(text):
        'Break text into a list of words without punctuation'
        return re.findall(r"[a-zA-Z']+", text)
    def match(a, b):
        # Make b the longer list.
        if len(a) > len(b):
            a, b = b, a
        # Map each word of b to a list of indices it occupies.
        b2j = {}
        for j, word in enumerate(b):
            b2j.setdefault(word, []).append(j)
        j2len = {}
        nothing = []
        unique = set() # set of all results
        def local_max_at_j(j):
            # maximum match ends with b[j], with length j2len[j]
            length = j2len[j]
            unique.add(" ".join(b[j-length+1: j+1]))
        # during an iteration of the loop, j2len[j] = length of longest
        # match ending with b[j] and the previous word in a
        for word in a:
            # look at all instances of word in b
            j2lenget = j2len.get
            newj2len = {}
            for j in b2j.get(word, nothing):
                newj2len[j] = j2lenget(j-1, 0) + 1
            # which indices have not been extended?  those are
            # (local) maximums
            for j in j2len:
                if j+1 not in newj2len:
            j2len = newj2len
        # and we may also have local maximums ending at the last word
        for j in j2len:
        return unique


    a = "They all are white a sheet of spotless paper " \
        "when they first are born but they are to be " \
        "scrawled upon and blotted by every goose quill"
    b = "You are all white, a sheet of lovely, spotless " \
        "paper, when you first are born; but you are to " \
        "be scrawled and blotted by every goose's quill"
    print match(to_words(a), to_words(b))

    编辑 - 工作原理


    对于输入序列set(['all', 'and blotted by every', 'first are born but', 'are to be scrawled', 'are', 'spotless paper when', 'white a sheet of', 'quill']) a,请使用b行和M列来生成矩阵len(a)。在此应用程序中,我们希望len(b)包含以M[i, j]a[i]结尾的最长公共连续子序列的长度,并且计算规则非常容易:

    1. b[j] if M[i, j] = 0
    2. a[i] != b[j] if M[i, j] = M[i-1, j-1] + 1(我们假设一个越界矩阵引用静默返回0)。
    3. 在这种情况下,解释也非常简单:本地最大非空匹配以a[i] == b[j]a[i]结尾,长度为b[j], if和仅当M[i, j]非零但M[i, j]为0或超出范围时。

      您可以使用这些规则来编写非常简单的&amp;紧凑的代码,有两个循环,可以正确计算M[i+1, j+1]这个问题。缺点是代码将采用(最佳,平均和最差情况)M时间空间。


      • 而不是一次传递来计算O(len(a) * len(b)),而是通过M在一次传递中交替解释结果,计算和解释的另一个传递。

      • 因此,不需要存储整个矩阵。而是仅同时存在当前行(a)和前一行(newj2len)。

      • 因为这个问题中的矩阵通常大多数为零,所以这里的一行稀疏地表示,通过dict将列索引映射到非零值。零条目是“免费的”,因为它们永远不会被明确存储。

      • 处理行时,不需要遍历每一列:预先计算的j2len dict告诉我们当前行中的有趣列索引(与当前行匹配的那些列{{1}来自b2j)。

      • 最后,部分是偶然的,所有先前的优化都以这样的方式合谋,即从来不需要知道当前行的索引,因此我们也不必费心计算。

      编辑 - 污垢简单版




      编辑 - 另一个没有字典


      def match(a, b):
          from collections import Counter
          M = Counter()
          for i in range(len(a)):
              for j in range(len(b)):
                  if a[i] == b[j]:
                      M[i, j] = M[i-1, j-1] + 1
          unique = set()
          for i in range(len(a)):
              for j in range(len(b)):
                  if M[i, j] and not M[i+1, j+1]:
                      length = M[i, j]
                      unique.add(" ".join(a[i+1-length: i+1]))
          return unique

答案 1 :(得分:4)





创建显示每个单词的位置图。例如,在句子you should do what you like中,you的映射为{"you": [0, 4]},因为它出现两次,一次位于零位置,一次位于位置四位。





max()函数找到最大值。它需要一个关键函数,如 len()来确定比较的基础。


import re

def to_words(text):
    'Break text into a list of lowercase words without punctuation'
    return re.findall(r"[a-z']+", text.lower())

def starting_points(wordlist):
    'Map each word to a list of indicies where the word appears'
    d = {}
    for i, word in enumerate(wordlist):
        d.setdefault(word, []).append(i)
    return d

def sequences_in_common(wordlist1, wordlist2, n=1):
    'Generate all n-length word groups shared by two word lists'
    starts = starting_points(wordlist2)
    for i, word in enumerate(wordlist1):
        seq1 = wordlist1[i: i+n]
        for j in starts.get(word, []):
            seq2 = wordlist2[j: j+n]
            if seq1 == seq2 and len(seq1) == n:
                yield ' '.join(seq1)

if __name__ == '__main__':

    t1 = "They all are white a sheet of spotless paper when they first are " \
         "born but they are to be scrawled upon and blotted by every goose quill"

    t2 = "You are all white, a sheet of lovely, spotless paper, when you first " \
         "are born; but you are to be scrawled and blotted by every goose's quill"

    w1 = to_words(t1)
    w2 = to_words(t2)

    for n in range(1,10):
        matches = list(sequences_in_common(w1, w2, n))
        if matches:
            print(n, '-->', max(matches, key=len))

答案 2 :(得分:2)



import difflib

def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

first_string = "this is a sample string"
second_string = "this is also a sample string"
print matches(second_string, first_string )


答案 3 :(得分:0)


def matche_words(first_string,second_string):
    l1 = first_string.split()
    l2 = second_string.split()
    s = difflib.SequenceMatcher(None, l1, l2)
    match = [l1[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match


>>> print '\n'.join(map(' '.join, matches(a,b)))
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every