使用Biopython查找和提取FASTA匹配到精确的DNA序列

时间:2017-10-20 21:51:09

标签: python biopython fasta

我正在尝试使用Biopython从FASTA文件中提取包含与以下短DNA序列匹配的所有DNA序列:" GGCTCAACCCTGGA"

这是我到目前为止所做的:

from Bio import SeqIO

source = "rep_set_no_spaces.fasta"
outfile = "rep_set_PNA_matches.fasta"
seq1 = "GGCTCAACCCTGGA"

# basically a function to check whether seq contains sub1
def seq_check(seq, seq1):
    return seq.find(seq1)

seqs = SeqIO.parse(source, 'fasta')
filtered = (seq for seq in seqs if seq_check(seq.seq, seq1))
SeqIO.write(filtered, outfile, 'fasta')

我正在尝试调整此帖子中的代码(Filtering a FASTA file based on sequence with BioPython),但我感兴趣的序列既不是序列的开头也不是结尾......

例如,以下是我的一些序列......第1和第4序列匹配,但第2和第3序列不匹配。我想拉出序列制作一个新的fasta文件,只包含那些包含" GGCTCAACCCTGGA"

的序列
>110148arco.1D_184193
TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGTGGATTGGAAAGTATGGGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTCATAAACTATCAGTCTAGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACACTGAGGTGCGAAAGTGTGGGGAGCAAACAGG
>110475arco.1D_40770
TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGCTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTTGATACTGGCAAGCTGGAGTACGAGAGAGGGAGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAATACCAGTGGCGAAGGCGGCCTCCTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>110484arco.1D_190999
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGCTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTTGATACTGATCGACTAGAGTACGAGAGAGGGAGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAATACCGGTGGCGAAGGCGGCCTCCTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>110525amin.3D_40107
TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTAGTAAGTAAGATGTGAAATCCCAGGGCTCAACCCTGGAACTGCATTTTAAACTGCTAGTCTAGAGTTATGGAGAGGTAAGTGGAATTCCTAGTGTAGAGGTGAAATTCGTAGATATTAGGAGGAACACCAGAGGCGAAGGCGACTTACTGGACATATACTGACGCTGAGGTACGAAAGTGTGGGTAGCAAACAGG

谢谢!

1 个答案:

答案 0 :(得分:1)

实际上,这个问题不是关于Biopython,而关于Python

def seq_check(seq, seq1):
    if seq1 in seq:
        return True
    else:
        return False

您也可以将它直接放入生成器表达式中:

filtered = (seq for seq in seqs if seq1 in seq)