Question

我有序列文件和条形码文件。条形码文件可以具有任何长度的条形码，其类似于＆＃34; ATTG，AGCT，ACGT＆＃34;例如。序列文件看起来像＆＃34; ATTGCCCCCCCGGGGG，ATTGTTTTTTTT，AGCTAAAAA＆＃34;例如。我需要将条形码与开头包含它们的序列进行匹配。然后，对于具有相同条形码的每组序列，我必须使用程序的其余部分（已经编写）对它们进行计算。我只是不知道如何让它们匹配。我已经完成了使用print语句和乱搞的部分是＆＃34; potential_barcode = line（：len（条形码）＆＃34;行。另外，它说#simple到fasta那里我应该是阅读匹配的序列。我对此非常陌生，所以我可能犯了很多错误。感谢您的帮助！

bcodefname = sys.argv[1]
infname = sys.argv[2]
barcodefile = open(bcodefname, "r")
for barcode in barcodefile:
        barcode = barcode.strip()
        print "barcode: %s" % barcode
        outfname = "%s.%s" % (bcodefname,barcode)
#           print outfname
        outf = open("outfname", "w")
        handle = open(infname, "r")
        for line in handle:
                potential_barcode = line[:len(barcode)]
                print potential_barcode
                if potential_barcode == barcode:
                        outseq = line[len(barcode):]
                        sys.stdout.write(outseq)
                        outf.write(outseq)
                        fastafname = infname + ".fasta"
                        print fastafname
                        mafftfname = fastafname + ".mafft"
                        stfname = mafftfname + ".stock"
                        print stfname
#simp to fasta#
#                       handle = open(infname, "r")
                        outf2 = open(fastafname, "w")
                        for line in handle:
                                linearr = line.split()
                                seqid = linearr[0]
                                seq = linearr[1]
                                outf2.write(">%s\n%s\n" % (seqid,seq))
#                       handle.close()
#                       outf.close()
#mafft#
                        cmd = "mafft %s > %s" % (fastafname,mafftfname)
                        sys.stderr.write("command: %s\n" % cmd)
                        os.system(cmd)
                        sys.stderr.write("command done\n")

Answer 1

我不知道为什么这段代码不适合你。但是这里有一些改进代码的技巧：

您正在阅读每个条形码的整个序列文件。如果有100个条形码，则您要读取序列文件100次。您应该做的是读取条形码文件一次并创建条形码列表。

首先定义一个我们用来检查匹配的函数：

def matches_barcode(sequence, barcodes):
  for bc in barcodes:
    if sequence.startswith(bc):
      return True
  return False

（请注意，我使用startswith而不是构建新字符串并进行比较; startswith应该更快。）

现在从文件中读取条形码：

barcodes = []
with open(bcodefname, "r") as barcodefile:
  for barcode in barcodefile:
    barcodes.append(barcode.strip())

（请注意，我使用了with open...;如果您有大量条形码，您的代码会泄漏打开的文件，这可能会阻止程序运行。）

然后读取序列文件一次，检查每个序列是否与条形码匹配：

with open(infname, "r") as seqfile:
  for sequence in seqfile:
    if matches_barcode(sequence.strip(), barcodes):
      # Got a match, continue processing.

这会快得多，因为它的I / O少得多：它读取2个文件而不是N + 1个文件，其中N是条形码的数量。但它仍然是一个非常天真的算法：如果它太慢，你将不得不研究更复杂的算法来检查匹配。

如果你仍然没有得到你期望的匹配，你需要调试：打印出比较的确切字符串，这样你就可以看到发生了什么。在这些情况下使用repr来真正查看数据是个好主意，包括空格和所有内容。

将条形码匹配到序列python？

1 个答案: