Python - 重叠范围 - 确定独特的位置

时间:2014-02-21 14:58:43

标签: python list range bigdata overlapping

我有一个大数据集,其中每个值分为3个部分[chromosome, start, end]。计算每条染色体所有独特位置的最快方法是什么,因为我有很多重叠范围。

所以例如:
[['chr1:10:60', 'chr1:5:70', 'chr3:50:80', 'chr1:54:90', 'chr1:120:180', 'chr3:50:90']]

应该导致:
['chr1:5:90', 'chr1:120:180', 'chr3:50:90']

我不知道是否有一种简单的方法来计算这个?但我觉得值得尝试在这里问一下。这里是我数据的一个子集。

提前致谢,

[['chr9:95149330:95149362', 'chr9:95149330:95149362', 'chr17:70386266:70386304', 'chr17:70386256:70386304', 'chr2:44672786:44672833', 'chr2:44672785:44672833', 'chr2:141966446:141966479', 'chr2:141966446:141966488', 'chr19:18126909:18126938', 'chr19:18126909:18127027', 'chr3:145082003:145082051', 'chr3:145082014:145082121', 'chr6:38835529:38835560', 'chr6:38835529:38835560', 'chr4:120372932:120372986', 'chr4:120372932:120372994', 'chr2:141014019:141014057', 'chr2:141014014:141014057', 'chr18:3445722:3445761', 'chr18:3445722:3445793', 'chr17:72329982:72330015', 'chr17:72329982:72330015', 'chr5:169911920:169911962', 'chr5:169911917:169911962', 'chr4:146482176:146482219', 'chr4:146482176:146482219', 'chr9:104285900:104285935', 'chr9:104285879:104285935', 'chr12:32941976:32942016', 'chr12:32941976:32942028', 'chrX:127923156:127923189', 'chrX:127923156:127923189', 'chr2:9535703:9535755', 'chr2:9535701:9535755', 'chr8:86476618:86476684', 'chr8:86476554:86476642', 'chr9:135756650:135756696', 'chr9:135756650:135756706', 'chr6:103004873:103004932', 'chr6:103004861:103004918', 'chr8:86476618:86476684', 'chr8:86476556:86476648', 'chr1:52280846:52280876', 'chr1:52280845:52280876', 'chr8:86476635:86476685', 'chr8:86476553:86476645', 'chr5:116046573:116046620', 'chr5:116046564:116046615', 'chrX:68039214:68039252', 'chrX:68039214:68039252', 'chr4:181491919:181491953', 'chr4:181491919:181491960', 'chr18:68050122:68050166', 'chr18:68050122:68050166', 'chr2:233985816:233985860', 'chr2:233985808:233985860', 'chr6:17020712:17020750', 'chr6:17020712:17020759', 'chr7:21950625:21950666', 'chr7:21950625:21950666', 'chr12:93292486:93292536', 'chr12:93292481:93292537', 'chr1:246515439:246515472', 'chr1:246515440:246515486', 'chr12:57084093:57084130', 'chr12:57084093:57084134', 'chr1:174801431:174801474', 'chr1:174801431:174801485', 'chr7:92499684:92499734', 'chr7:92499924:92499960', 'chr17:40328527:40328560', 'chr17:40328518:40328560', 'chr8:42944072:42944110', 'chr8:42944073:42944120', 'chr17:29890450:29890499']

2 个答案:

答案 0 :(得分:1)

我会分三步完成:

  1. 拆分每条染色体的范围;
  2. 提取连续范围;和
  3. 根据需要汇总输出("chr:start:end")。
  4. 第一步:

    from collections import defaultdict
    
    processed = defaultdict(list)
    
    for s in data:
        chr_, pos = s.split(":", 1)
        processed[chr_].append(list(map(int, pos.split(":"))))
    

    对于

    data == ['chr1:10:60', 'chr1:5:70', 'chr3:50:80', 
             'chr1:54:90', 'chr1:120:180', 'chr3:50:90']
    

    这给出了

    processed == defaultdict(<class 'list'>, 
                             {'chr3': [[50, 80], [50, 90]], 
                              'chr1': [[10, 60], [5, 70], [54, 90], [120, 180]]})
    

    我们现在可以根据重叠将这些组合在一起

    for vals in processed.values():
        finished = False
        while not finished:
            finished = True
            for i, (s1, e1) in enumerate(vals):
                for s2, e2 in vals[i+1:]:
                    if ((s2 <= s1 and e2 >= s1) or
                        (s2 <= e1 and e2 >= e1)):
                       vals[i][0] = min(s1, s2)
                       vals[i][1] = max(e1, e2)
                       vals.remove([s2, e2])
                       finished = False
    

    让我们来:

    processed == defaultdict(<class 'list'>, 
                             {'chr3': [[50, 90]], 
                              'chr1': [[10, 90], [120, 180]]})
    

    现在你可以把它重新组合在一起:

    final = []
    for key, vals in processed.items():
        for start, end in vals:
            final.append(":".join(map(str, (key, start, end))))
    

    离开:

    final == ['chr3:50:90', 'chr1:10:90', 'chr1:120:180']
    

答案 1 :(得分:1)

我同意jonrsharpe关于一般方法的观点,但我认为有一种更优雅的方法。

首先,我们将获得每条染色体的范围(与jonrsharpe几乎相同,尽管我喜欢元组比范围列表更好)。

from collections import defaultdict

processed = defaultdict(list)

for s in data:
    chr_, start, end = s.split(":")
    processed[chr_].append((int(start), int(end)))

现在,我们可以通过在范围的开头对每个染色体的列表进行排序来使合并变得更加简单。这为我们提供了一个很好的属性,如果之前的范围都没有与当前范围重叠,那么我们知道我们对之前的值进行的任何合并都是最终的,我们不必再回到它。

for vals in processed.values():
    vals.sort()
    current = 1
    while current < len(vals):
      if vals[current-1][1] > vals[current][0]:
        # current and previous ranges overlap, so merge previous and current values.
        vals[current-1:current+1] = [(vals[current-1][0], vals[current][1])]
        # Because we reduced the number of values in the list by 1,
        # current now points at the next interesting value.
      else:
        current += 1 # We didn't merge, so we must increment current

现在我们可以像jonrsharpe那样把它重新组合在一起:

final = []
for key, vals in processed.items():
    for start, end in vals:
        final.append("%s:%s:%s" % (key, str(start), str(end)))

这也提供了final == ['chr3:50:90', 'chr1:5:90', 'chr1:120:180']