寻找重叠和非重叠区域

时间:2014-04-02 09:48:07

标签: python bioinformatics

我有一个包含数百万次点击的BLAST表格输出.Con是我的序列,P是蛋白质命中。我有兴趣区分对应于下面说明的3种情况的命中。它们应该全部打印在3个单独的新文件中,文件1中的重叠群不应该在文件2,3等中。如何做到这一点?

con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
   p1---- p2 ------    p4---
               p3-----
con2 --------------------- (only overlapping)  con3 ----------------(only non overlp)
   p1 -----                                       p1 ----  p2 -----
      p2 -------
          p3 ----- 

如果我知道蛋白质起始和终止位点,则可以识别重叠或非重叠;如果S1 < E2&lt; S2和E1&lt; S2&lt; E2 OR S2-E1&gt; 0。 我的输入文件,即

contig  protein start end
con1    P1  481 931
con1    P2  140 602
con1    P3  232 548
con2    P4  335 406
con2    P5  642 732
con2    P6  2282    2433
con2    P7  729 812
con3    P8  17  148
con3    P9  289 45

我的脚本(这只会打印出不重叠的点击)

from itertools import groupby

def nonoverlapping(hits):
    """Returns a list of non-overlapping hits."""
    nonover = []
    overst =  False
    for i in range(1,len(hits)):
        (p, c) = hits[i-1], hits[i]
        if c[2] > p[3]:
            if not overst: nonover.append(p)
            nonover.append(c)
            overst = True  
    return nonover

fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    if len(hits) > 1: 
        hits.sort(key=lambda x: x[2])
        for hit in nonoverlapping(hits):
            oh.write('\t'.join([str(f) for f in hit])+'\n')

1 个答案:

答案 0 :(得分:2)

我会做这样的事情。为两个命中定义“重叠”函数,然后测试每个重叠群是否全部,一些或不重叠。然后将所有重叠群写入所需的文件:

from itertools import groupby

def overlaps(a, b):
    result = True
    # Supposing a[2] is the start, a[3] the end. 
    # If end before start, they are not overlapping
    if a[3] < b[2] or b[3] < a[2]:
        result = False

    return result

def test_overlapping(hits):
    overlapping = 'None'
    overlapping_count = 0
    for i in range(len(hits)-1):
        if overlaps(hits[i], hits[i+1]):
            overlapping_count += 1


    if overlapping_count == 0:
        overlapping = 'None'
    elif overlapping_count == len(hits) -1:
        overlapping = 'All'
    else:
        overlapping = 'Some'

    return overlapping


fh = open('file.txt')

file_all = open('result_all.txt', 'w')
file_some = open('result_some.txt', 'w')
file_none = open('result_none.txt', 'w')

line = fh.readline() # quit header

for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    if len(hits) > 1: 
        hits.sort(key=lambda x: x[2])
        overlapping = test_overlapping(hits)
        out_file = file_none
        if overlapping == 'All':
            out_file = file_all
        elif overlapping == 'Some':
            out_file = file_some

        for h in hits:
            out_file.write('\t'.join([str(v) for v in h]))
            out_file.write('\n')


file_all.close()
file_some.close()
file_none.close()