Question

从任何* .fasta DNA序列（仅'ACTG'字符）我必须找到每个字母至少重复一次的所有序列。

对于序列'AAGTCCTAG'的测试，我应该能够找到：'AAGTC'，'AGTC'，'GTCCTA'，'TCCTAG'，'CCTAG'和'CTAG'（每个字母的迭代）。

我不知道如何在pyhton 2.7中做到这一点。我正在尝试使用正则表达式，但它没有搜索每个变体。

我怎样才能实现？

Answer 1

你可以找到长度为4+的所有子串，然后向下选择那些只能找到包含每个字母之一的最短可能组合：

s = 'AAGTCCTAG'

def get_shortest(s):
  l, b = len(s), set('ATCG')
  options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
  return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]

print(get_shortest(s))

输出：

['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

Answer 2

这是你可以做到的另一种方式。也许不像chrisz answere那么快和好。但对于初学者来说，阅读和理解可能有点简单。

DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
    letters=['A','G','T','C']
    j=i
    seq=[]
    while len(letters)>0 and j<(len(DNA)):
        seq.append(DNA[j])
        try:
            letters.remove(DNA[j])
        except:
            pass
        j+=1
    if len(letters)==0:
        toSave.append(seq)

print(toSave)

Answer 3

由于您要查找的子字符串可能大约有任何长度，因此LIFO队列似乎有效。每次附加每个字母，检查每个字母中是否至少有一个字母。如果发现它返回。然后删除前面的字母并继续检查，直到不再有效。

def find_agtc_seq(seq_in):
    chars = 'AGTC'
    cur_str = []
    for ch in seq_in:
        cur_str.append(ch)
        while all(map(cur_str.count,chars)):
            yield("".join(cur_str))
            cur_str.pop(0)

seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
    print(substr)

这似乎会导致您正在寻找的子串：

AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG

Answer 4

我真的想为此创建一个简短的答案，所以这就是我想出来的！

See code in use here

s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)

while c <= len(s):
    x,c = s[:c],c+1
    if all(l in x for l in d):
        print(x)
        s,c = s[1:],len(d)

它的工作原理如下：

c设置为我们确保在字符串中存在的字符串的长度（d = ACGT）
while循环遍历s的每个可能子字符串，使c小于s的长度。
- 这可以通过c循环的每次迭代将while增加1来实现。
- 如果字符串d（ACGT）中的每个字符都存在于子字符串中，我们打印结果，将c重置为其默认值，并将字符串切成1个字符。启动。
- 循环继续，直到字符串s短于d

结果：

AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG

要在列表中获取输出（see code in use here）：

s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]

while c <= len(s):
    x,c = s[:c],c+1
    if all(l in x for l in d):
        r.append(x)
        s,c = s[1:],len(d)

print(r)

结果：

['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

Answer 5

如果您可以将序列分解为列表，例如然后，您可以使用此函数查找重复的序列。

from itertools import groupby
import numpy as np

def find_repeats(input_list, n_repeats):
    flagged_items = []

    for item in input_list:
        # Create itertools.groupby object
        groups = groupby(str(item))

        # Create list of tuples: (digit, number of repeats)
        result = [(label, sum(1 for _ in group)) for label, group in groups]

        # Extract just number of repeats
        char_lens = np.array([x[1] for x in result])   

        # Append to flagged items
        if any(char_lens >= n_repeats):
            flagged_items.append(item)

    # Return flagged items
    return flagged_items

#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']

find_repeats(test_list, n_repeats=2)  # Returns ['aatcg', 'ctagg']

找到正则表达式，每个字母至少重复一次

5 个答案: