在字符串中搜索列表中字符的子字符串

时间:2015-07-08 15:26:17

标签: python list file

  

sp | P46531 | NOTC1_HUMAN神经源性基因座缺口同源蛋白1 OS =智人(Nomo sapiens)GN = NOTCH1 PE = 1 SV = 4   MPPLLAPLLCLALLP

我有一个fasta文件,我想在文件中搜索氨基酸序列的开头。这就像是

aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
for filename in file_list:
    with open(filename,'r') as fh:
        while True:
        char = fh.read(1)
        if char.upper() in aminoacids:
            #look for the 4 characters directly after it

但是如果发现一个字符在氨基酸列表中并且它旁边的四个字符也在列表中,那么将从该字符开始生成一个字符串,直到没有其他字符为止。 例如,我想遍历文件寻找字符。如果找到M,那么我想寻找接下来的四个字符(PPLL)。如果接下来的四个字符是氨基酸,那么我想创建一个以M开头并继续到文件末尾的字符串。

1 个答案:

答案 0 :(得分:2)

You can read in the file as a single string, and then search for a regular expression:

regex = re.compile("[%s]{5}.*" % "".join(aminoacids))

with open(filename, 'r') as fh:
    s = fh.read()
    aa_sequence = regex.findall(s)
    if len(aa_sequence) > 0:
        # an amino acid sequence was found
        print aa_sequence[0]

This works because the regular expression that is constructed is:

[ACDEFGHIKLMNPQRSTVWY]{5}.*

which means "5 of these characters, followed by anything."

Note that if your amino acid string may span multiple lines, you'll need to remove the newlines first, with:

s = fh.read().replace('\n', '')
# or
s = "".join(s.readLines())