找到最佳匹配序列

时间:2014-10-16 00:08:30

标签: python hamming-distance

我有2个序列文件。说ham1.txt:

AAACCCTTTGGG
AGGTACTTTTTT
TCTCTTTTTTTT

等等

ham2.txt:

AAACCCTTTGGG
GAGAGGGAGGGC
AGGTACTTTTTT
CTCTTAATTTCC
TCTCTTTTTTTT
GTTTTTAAAAAA

我希望将ham1.txt中的序列与ham2.txt中的序列进行匹配,具体取决于哪一对具有最小汉明距离。我的python代码打印了所有这些之间的汉明距离。我只想要最好的配对。这是我的代码

def hamming_distance(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

with open('ham1.txt','r') as file1:
                for s1 in file1:
                        with open('ham2.txt','r') as file2:
                                for s2 in file2:
                                        dist = hamming_distance(s1,s2)
                                        print s1,s2,dist

你能建议编辑吗?感谢

3 个答案:

答案 0 :(得分:1)

你应该看看itertools.product

In [7]:

L1 = ['AAACCCTTTGGG',
      'AGGTACTTTTTT',
      'TCTCTTTTTTTT']
L2 = ['AAACCCTTTGGG',
      'GAGAGGGAGGGC',
      'AGGTACTTTTTT',
      'CTCTTAATTTCC',
      'TCTCTTTTTTTT',
      'GTTTTTAAAAAA']
def hamming_distance(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
import itertools
res = [[hamming_distance(*item), item[0], item[1]] for item in itertools.product(L1, L2)]
sorted(res)[0]
Out[7]:
[0, 'AAACCCTTTGGG', 'AAACCCTTTGGG']

答案 1 :(得分:0)

我已经生成了以下列表

0 AAACCCTTTGGG AAACCCTTTGGG
0 AGGTACTTTTTT AGGTACTTTTTT
0 TCTCTTTTTTTT TCTCTTTTTTTT
6 AGGTACTTTTTT TCTCTTTTTTTT
6 TCTCTTTTTTTT AGGTACTTTTTT
7 AAACCCTTTGGG AGGTACTTTTTT
7 AGGTACTTTTTT AAACCCTTTGGG
8 AAACCCTTTGGG TCTCTTTTTTTT
8 AGGTACTTTTTT CTCTTAATTTCC
8 TCTCTTTTTTTT AAACCCTTTGGG
8 TCTCTTTTTTTT CTCTTAATTTCC
9 AAACCCTTTGGG GAGAGGGAGGGC
9 TCTCTTTTTTTT GTTTTTAAAAAA
10 AAACCCTTTGGG CTCTTAATTTCC
11 AGGTACTTTTTT GAGAGGGAGGGC
11 AGGTACTTTTTT GTTTTTAAAAAA
12 AAACCCTTTGGG GTTTTTAAAAAA
12 TCTCTTTTTTTT GAGAGGGAGGGC

我想这就是你的需要,对吗?

为实现这一点,我们使用了几个liberies。 首先,我将数据流/字符串转换为值列表,然后我采取每一个poosible ham1ham2的组合,并创建一个包含汉明值的新列表, 然后我对它们进行排序。

这对你有帮助吗?否则只要问我会帮你解决;)

使用的代码如下。

from distance import hamming
from collections import Counter
from itertools import product

ham1="""
AAACCCTTTGGG
AGGTACTTTTTT
TCTCTTTTTTTT
"""

ham2="""
AAACCCTTTGGG
GAGAGGGAGGGC
AGGTACTTTTTT
CTCTTAATTTCC
TCTCTTTTTTTT
GTTTTTAAAAAA
"""

ham1data = filter(None, ham1.splitlines())
ham2data = filter(None, ham2.splitlines())

res = [(hamming(h1,h2), h1, h2) for h1, h2, in product(ham1data, ham2data)]

for v, h1, h2 in sorted(res):
    print v, h1, h2

答案 2 :(得分:0)

我会使用functools.reduce

from functools import reduce


def hamming_distance(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

if __name__ == '__main__':
    with open('h1.txt') as f:
        f1 = f.read().splitlines()

    with open('h2.txt') as f:
        f2 = f.read().splitlines()

    for line in f1:
        print(line, reduce(lambda x, y: x if hamming_distance(line, y) > hamming_distance(line, x) else y, f2))

输出:

AAACCCTTTGGG AAACCCTTTGGG
AGGTACTTTTTT AGGTACTTTTTT
TCTCTTTTTTTT TCTCTTTTTTTT