如何找到大字符串的最佳拟合子序列?

时间:2017-08-31 21:15:01

标签: python algorithm levenshtein-distance fuzzy-comparison lcs

假设我有一个大字符串和一个子串数组,当连接时等于大字符串(差异很小)。

例如(注意字符串之间的细微差别):

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]

如何最好地对齐字符串以从原始large_str生成一组新的子字符串?例如:

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]

其他信息

用例是从PDF文档中提取的文本的现有分页符中查找原始文本的分页符。从PDF中提取的文本是OCR并且与原始文本相比具有较小的错误,但原始文本没有分页符。目标是准确地分页原文,避免PDF文本的OCR错误。

3 个答案:

答案 0 :(得分:3)

  1. 连接子字符串
  2. 将串联与原始字符串对齐
  3. 跟踪原始字符串中的哪些位置与子字符串之间的边界对齐
  4. 将原始字符串拆分到与这些边界对齐的位置
  5. 使用Python difflib的实现:

    from difflib import SequenceMatcher
    from itertools import accumulate
    
    large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
    
    sub_strs = [
      "hello, ths is a lng strin",
      ", that ay be mad up of multiple",
      "subsrings tat aproimately ",
      "match the orginal strng"]
    
    sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))
    
    sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)
    
    match_index = 0
    matches = [''] * len(sub_strs)
    
    for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
      if tag == 'delete' or tag == 'insert' or tag == 'replace':
        matches[match_index] += large_str[i1:i2]
        while j1 < j2:
          submatch_len = min(sub_str_boundaries[match_index], j2) - j1
          while submatch_len == 0:
            match_index += 1
            submatch_len = min(sub_str_boundaries[match_index], j2) - j1
          j1 += submatch_len
      else:
        while j1 < j2:
          submatch_len = min(sub_str_boundaries[match_index], j2) - j1
          while submatch_len == 0:
            match_index += 1
            submatch_len = min(sub_str_boundaries[match_index], j2) - j1
          matches[match_index] += large_str[i1:i1+submatch_len]
          j1 += submatch_len
          i1 += submatch_len
    
    print(matches)
    

    输出:

    ['hello, this is a long string', 
     ', that may be made up of multiple ', 
     'substrings that approximately ', 
     'match the original string']
    

答案 1 :(得分:2)

您正在尝试解决序列比对问题。在您的情况下,它是一个“本地”序列比对。它可以用Smith-Waterman方法解决。一种可能的实现是here。 如果你运行它,你会收到:

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng sin", ", that ay be md up of mulple", "susrings tat aproimately ", "manbvch the orhjgnal strng"]

for sbs in sub_strs:
    water(large_str, sbs)


 >>>

Identity = 85.185 percent
Score = 210
hello, this is a long strin
hello, th s is a l ng s  in
hello, th-s is a l-ng s--in

Identity = 84.848 percent
Score = 255
, that may be made up of multiple
, that  ay be m d  up of mul  ple
, that -ay be m-d- up of mul--ple

Identity = 83.333 percent
Score = 225
substrings that approximately 
su s rings t at a pro imately 
su-s-rings t-at a-pro-imately 

Identity = 75.000 percent
Score = 175
ma--tch the or-iginal string
ma   ch the or  g nal str ng
manbvch the orhjg-nal str-ng

中间一行显示匹配的字符。如果您需要这些头寸,请查找max_i值,以便在原始字符串中获得结束位置。 起始位置将是i函数末尾的water()值。

答案 2 :(得分:1)

(附加信息使得以下内容变得很多。这是为了提供的子字符串可能是它们在主字符串中出现的顺序的任何排列而编写的)

对于与此非常接近的问题,将会有一个动态编程解决方案。在为您提供编辑距离的动态编程算法中,动态程序的状态为(a,b),其中a是第一个字符串的偏移量,b是第二个字符串的偏移量。对于每对(a,b),您计算出与第一个字符串的第一个字符和第二个字符串的前两个字符匹配的最小可能编辑距离,从(a-1,b)计算出(a,b) -1),(a-1,b)和(a,b-1)。

现在可以用状态(a,n,m,b)编写类似的算法,其中a是到目前为止子字符串消耗的字符总数,n是当前子字符串的索引,m是该字符串中的位置current substring,b是第二个字符串中匹配的字符数。这解决了将b与通过将任何可用子字符串的任意数量的副本粘贴在一起而组成的字符串进行匹配的问题。

这是一个不同的问题,因为如果你试图从片段重建一个长字符串,你可能会得到一个不止一次使用同一个片段的解决方案,但如果你这样做,你可能希望答案是明显的它产生的子串的集合恰好是给它的集合的排列。

因为当您强制进行排列时,此方法返回的编辑距离始终至少与最佳编辑距离一样好,您还可以使用它来计算排列的最佳编辑距离的下限,以及运行分支定界算法以找到最佳排列。