Question

我经过了太长时间的休息后回到了python，现在我正在努力完成一个简单的任务，即将文件A中的数字与文件B中的所有数字进行比较，循环访问文件A以执行每个数字每一行。数字是文件A在第2列（由\ t分割），这些要返回的数字必须大于exonStart（文件B的第4列），然后小于exonStop（文件B的第5列）。最后，我想将这些行（文件的完整行A附加到文件B中，与该参数匹配）写入新文件。

fileA (trimmed for relevant info and truncated):
    1       10678   12641
    1       14810   14929 
    1       14870   14969  

fileB (trimmed for relevant info and truncated):
    1       processed_transcript    exon    10000   12000  2
    1       processed_transcript    exon    10500   12000  2
    1       processed_transcript    exon    12613   12721  3     
    1       processed_transcript    exon    14821   14899  4

我的代码尝试了代码，我会更详细地解释它。

f = open('fileA')
f2 =open('fileB')

for line in f:
    splitLine= line.split("\t")
    ReadStart= int(splitLine[1])
    print ReadStart
    for line2 in f2:
        splitLine2=line2.split("\t")
        ExonStart = int(splitLine2[3])
        ExonStop = int(splitLine2[4])
        if ReadStart < ExonStop and ReadStart > ExonStart:
            print ReadStart, ExonStart, ExonStop
        else:
            print "BOO"   
f.close()

我的期望是（来自我的代码）：第一个col是文件B中的ReadStart，后两个来自文件A

    10678   10000   12000
    10678   10500   12000
    14870   14821   14899

我的代码只返回第一行。

Answer 1

问题就在这里：

splitLine2=line.split("\t")

如果您使用的是文件2，则为

splitLine2=line2.split("\t")

Answer 2

问题是你的文件指针。您在代码顶部打开文件B，然后在处理文件A的第一行时一直遍历它。这意味着在外部循环的第一次迭代结束时，您的文件指针现在指向< em>文件B的结束。在下一次迭代中，没有更多行要从文件B读取，因为指针位于文件的末尾，因此跳过内部循环。

一种选择是在外部循环结束时使用文件B上的seek函数将文件指针重置为文件顶部：

f2.seek(0)

但是，我会主张你改变你的方法并将文件B读入内存，所以你不是一遍又一遍地读取文件：

# use context managers to open your files instead of file pointers for
# cleaner exception handling
with open('f2.txt') as f2:

    exon_points = []

    for line in f2:
        split_line = line.split() # notice that the split function will split on
                                  # whitespace by default, so "\t" is not necessary

        # append a tuple of the information we care about to the list
        exon_points.append(((int(split_line[3]), int(split_line[4]))))

with open('f1.txt') as f1:

    for line in f1:
        read_start = int(line.split()[1])  

        for exon_start, exon_stop in zip(exon_starts, exon_stops):

            if read_start < exon_stop and read_start > exon_start:
                print("{} {} {}".format(read_start, exon_start, exon_stop))

             else:
                 print("BOO")

输出：

10678 10000 12000
  10678 10500 12000
  BOO
  BOO
  BOO
  BOO
  BOO
  14830 14821 14899
  BOO
  BOO
  BOO
  14870 14821 14899

Python：比较两个文件中的数字

2 个答案: