Question

我比较了两个fasta文件（具有不同长度的序列和名称），并将共享序列名称放入列表中。我试图用列表中的名字来获取序列。

文件1：

SRR3350720.1

atccaaccactaaagcagtggtatcaacgcagagtacatggggacattcagtgattatggcatgcactgggtc

SRR3350720.3

caggtgcaggtggtgcagtctggggctgaggtgaagaagcctggctcctcggtgaagatctcatgcaaggctt

SRR3350720.5

caggtccagctggtacagtctggggctgaggtgaagaagcctggggcctcagtgaaggtctcctgcaaggttg

SRR3350720.6

caggtgcagttccagccgtggggcgcaggactgttgaagccttcggagaccctgtccctcacctgcgctgtct

list = [＆＃39; SRR3350720.1＆＃39;，＆＃39; SRR3350720.5＆＃39;]

我在python中尝试过这个脚本。

import HTSeq
fasta_file = HTSeq.FastaReader('file1.fasta', 'r')
for line in fasta_file:
    for ls in list:
        if str(line.name) == ls:
            print str(line)

但是对于每个NGS测序，我在列表中有数百万个序列和10万个序列id。如何改进脚本并有效地处理数据。

Answer 1

您的流程主要是I/O bound，即它无法帮助您在多个CPU上并行化代码。加速它的最佳方法是将fasta复制到SSD或直接复制到内存。

关于您的代码，假设您的序列ID存储在名为seq_ids的列表中。

import HTSeq
seq_ids = ['SRR3350720.1','SRR3350720.5']
fasta_file = HTSeq.FastaReader('file1.fasta', 'r')
for read in fasta_file:
    if str(read.name) in seq_ids:
        print str(read)

Explantion：

str(read.name) in seq_ids

检查read.name是否在您的列表中，然后才打印出读取本身。

在你的代码中，你循环遍历搜索列表以进行条目阅读，即使一个阅读匹配，你仍然继续你的循环。

如果您只需要标题和一行序列，请尝试使用grep

grep -A1 -w -f list.txt file1.fasta

-w打印匹配行

匹配后

-A1打印行

-f使用list.txt中的模式

file1.fasta搜索的文件

如何从fasta文件中获取索引列表序列？

1 个答案: