在Python中识别类似的字符串

时间:2015-07-06 21:05:12

标签: python

我已经生成了一个编辑过的DNA测序文件,该文件在不同的行上有单独的读数。并且想要消除那些在另一行的一个字符内匹配的那些。

输入文件:

AAAAAAAAAAAA    #Start checking at line 1
TTTTTTTTTTTT    #Diff by >1 char: Keep
AAAAACAAAAAA    #Diff by 1 char: Delete
AAAAACAAACAA    #Diff by 2 char: Keep
AAAAAAAAAAAA    #Diff by <1 char: Delete

输出文件:

AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA

到目前为止我所拥有的:

with open(current_file, 'r') as f:
    lineCharsList = []
    outLines = []
    for line in f:
        lineChars = line[:]

        if not (lineChars in lineCharsList):    #exactly matches lines, need partial matching
            lineCharsList.append(lineChars)
            outLines.append(line)
            print line

2 个答案:

答案 0 :(得分:2)

pip install python-levenshtein并使用函数Levenshtein.hamming来比较字符串。

  

hamming(string1, string2)计算两个琴弦的汉明距离。

     

汉明距离只是不同字符的数量。   这意味着字符串的长度必须相同。

     

示例:

>>> hamming('Hello world!', 'Holly grail!') 7
>>> hamming('Brian', 'Jesus') 5

代码是:

import Levenshtein

input_lines = [
    "AAAAAAAAAAAA",
    "TTTTTTTTTTTT",    # Diff by >1 char: Keep
    "AAAAACAAAAAA",    # Diff by 1 char: Delete
    "AAAAACAAACAA",    # Diff by 2 char: Keep
    "AAAAAAAAAAAA",    # Diff by <1 char: Delete
    ]
output_lines = []

for current_line in input_lines:
    for previous_line in output_lines:
        if Levenshtein.hamming(previous_line, current_line) < 2:
            break
    else:
        output_lines.append(current_line)

print('\n'.join(output_lines))

输出:

AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA

答案 1 :(得分:1)

你已经得到了一个很好的答案。

这是我在基本python中的实现

with open(current_file, 'r') as f:
    outlines = []
    for line in f:
        z = zip(line, *[el for el in outlines])
        matches = [el[0] in el[1:] for el in z]
        if matches.count(False) > 1:
            outlines.append(line)