打印匹配和匹配后的行

时间:2017-08-24 11:39:41

标签: python python-2.7 bioinformatics fasta text-manipulation

我的文件包含82对ID:

EmuJ_000063620.1    EgrG_000063620.1    253 253
EmuJ_000065200.1    EgrG_000065200.1    128 128
EmuJ_000081200.1    EgrG_000081200.1    1213    1213
EmuJ_000096200.1    EgrG_000096200.1    295 298
EmuJ_000114700.1    EgrG_000114700.1    153 153
EmuJ_000133800.1    EgrG_000133800.1    153 153
EmuJ_000139900.1    EgrG_000144400.1    2937    2937
EmuJ_000164600.1    EgrG_000164600.1    167 167

我还有另外两个文件,其中包含EmuJ_* ID和EgrG_* ID的序列,如下所示:

EgrG_sequences.fasta

>EgrG_000632500.1
MKKKSHRKSPEGNHSLTKAANKDTAKCNEERGRNIGQSNEEENATRSEKDREGDEDRNLREYVISIAQKYYPHLVSCMRQDDDNQASADARGADGANDEEHCPKHCPRLNAQKYYLYSATCNHHCEDSQASCDEEGDGKRLLKQCLLWLTERYYPSLAARIRQCNDDQASSNAHGADETDDGDRRLKQALLLFAKKLYPCVTTCIRHCVADHTSHDARGVDEEVDGEQLLKQCLHSSAQKFYPRLAACVCHCDADHASTETCGALGVGNAERCPQQCPCLCAQQYYVQSATCVHHCDNEQSSPETRGVKEDVDVEQLLKQCLLMFAEKFHPTLAAGIRSCADDESSHVASVEGEDDADKQRLKQYLLLFAQKYYPHLIAYIQKRDDDQSSSSVRDSGEEANEEEERLKQCLLLFAQKLYPRLVAYTGRCDSNQSTSDGCSVDGEEAEKHYLKQSLLLLAQKYYPSLAAYLRQFDDNQSSSDVRSVDEEEAEKRHLKQGLLFFAEKYYPSLATYIRRCDDDQSSSDARVVDEVDDEDRRLKQGLLLLAQKYYPPLANYIRHSQSSFNVCGADEKEDEEHCLNQLPRLCAQEAYIRSSSCSHHCDDDQASNDTLVVDKEEEEKYRLKQGLLLLAQKFYPPLATCIHQCDDQSSHDTRGVDEEEAEEQLLKKCLLMFAEKFYPSLAATIHHSVYDQASFDMRDVDTENDETHCLSLSAENYSTASTTCIHHSDGDQSTSDACGVEEGDVEEQRLKRGLLLLAQKYYPSLAAYICQCDDYQPSSDVCGVGEEDTGEERLKQCLLLFAKKFYPSLASRNSQCGDNLILNDEVVGETVINSDTDTDEVTPVEKSTAVCDEVDEVPFKYVGSPTPLSDVDVDSLEKVIPPNDLTAHSSFQNSLDHSVEGGYPDRAFYIGRHTVESADSTAPLSKSSSTKLYFSNTDEFPTEEEVSSPIAPLSIQRRIRIYLEDLENVRKVSLIPLCKTDKFGNPQEEIIIDSNLDDDTDESKLSSVDVEFTMEQADATPLDLEAQDEDLKNCVAIILKHIWSELMECIRREGLSDVYELSLGDRRIEVPQDDVCLVR*
>EgrG_000006700.1
MTDTKGPDESYFEKEAFSSLPQPVDSPSASATDTDRIPVVAVSLPVSSGSIDVNCNCSCYLIICETKLIIDYQMTRKW*

等等。 EmuJ_sequences.fasta的情况相同 我需要得到每一对的序列,然后按顺序写一个维持这样的顺序:

>EmuJ_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYENQCRLDTEVKRLCSNIQTFNRQVDMWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EgrG_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYDNQCRLDTEVKRLCSNIQTFNCQVDLWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EmuJ_000065200.1
MLCLITPFPSVVPVCVRTCVCMCPCPLLLILYTWSAYLVPFSLPLCLYAHFHIRFLPPFSSLSIPRFLTHSLFLPSYPPLTMLRMKKSLAPCPAERR*
>EgrG_000065200.1
MLCLVTSFPSAVPVCMRTCVCMCSCPLLLILYTWSAYLVPFSLPLCLYTHLHIRFLPPFPSLAIPRFLTHPLFLPTSLYVADKKEPSAMPRRASLRQMLLIVLLQELH*
>EmuJ_000081200.1
MNSLRIFAVVITCLMVVGFSYSIHPTFPSYQSVVWHSSANTGYECRDGICGYRCSNPWCHGFGSILHPQMGVQEMWGSAAHGRHAHSRAMTEFLAKASPEDVTMLIESTPNIDEVITSLDGEAVTILINKLPNLRRVMEELKPQTKMHIVSKLCGKVGSAMEWTEARRNDGSGMWNEYGSGWEGIDAIQDLEAEVIMRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPTPATSVSSMDPLVLRSIILNMPNLNDILMQVDPVYLQSALVHVPGFGAYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQRMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPRAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPTRFANVLLHVKPDYVRYIIEKHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPRVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAATASATEEPTVTVDYANLLRSKIPLIDNVIKMSDPEKVAILRDNLLDVSRILVNLDPTMLRNINSIIFNATKMLNELSVFLVEYPLEYLHKEGKSGVAVNKSEQVGTTGENGVSSIAVEKLQMVLLKIPLFDQFLKWIDQKKLHELLNKIPTLLEVIATANQETLDKINSLLHDAIATMNTAKKLIVTGICRKLAEEGKLRLPRVCPSAST*
>EgrG_000081200.1
MNLLRIFAVVITCLIVVGFGYPTHPTFPSYQTAVWHSSANTGYRCRAGICGYRCSSPWCHGFESALYPQMAMQEMWGSGAHGRHAHSRTMTEFLMKASPEDLTMLIESTPNIDEVITSLDSEAIIILINKLPNLRRVMEKLKPQTKMHIVSKLCDKVGNAMEWAGARRNDGSGMWNEYGSVWEGIDAIQDLEAEMITRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPMPATLVSNMDPHVLRSIILNMPNLDDILMQVDPVHLQSALMYVPGFGTYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQWMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPKAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPARFANVLLHVKPDYVRYIIENHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPNVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAPTASVTEASMVTVDYAHLLRSKIPLIDNVIKMSDPAKVAILRDNLLDVGTTDENGVSSITVEKLQMVLLKIPLFDQFLNWIDSKKLHALLQKIPTLLEVIATANQEALDKINLLLHDAIATMNTAKKLIVTSICRKLAEEGKLRLPRVCPSTST*

等等。

我用bash写了一个脚本来做这个,它就像我想要的那样工作,非常简单。现在我试图在Python中做同样的事情(我正在学习),但我很难以pythonic的方式做同样的事情。 我试过这个,但我只得到了第一对,然后停了下来:

rbh=open('rbh_res_eg-not-sec.txt', 'r')
ems=open('em_seq.fasta', 'r')
egs=open('eg_seq.fasta', 'r')

for l in rbh:
    emid=l.split('\t')[0]
    egid=l.split('\t')[1]
    # ids=emid+'\n'+egid 
    # print ids # just to check if split worked
    for lm in ems:      
        if emid in lm:
            print lm.strip()
            print next(ems).strip()
    for lg in egs:
        if egid in lg:
            print lg.strip()
            print next(egs).strip()

我尝试了一些变化,但我只有ID,没有序列。 那么,如何在序列文件中找到ID,打印它以及它后面的行(带序列的行引用ID)? 如果我清楚地解释了,请告诉我。

1 个答案:

答案 0 :(得分:1)

对文件进行迭代会移动文件指针,直到它到达文件的末尾(最后一行),因此在外部循环的第一次迭代后,InStrems文件已耗尽

快速和肮脏的解决方法是在外部循环结束时将egsems指针重置为零,即:

egs

请注意,在已经迭代for line in rbh: # no need to split twice parts = line.split("\t") emid, egid = parts[0].strip(), parts[1].strip() for lm in ems: if emid in lm: print lm.strip() print next(ems).strip() ems.seek(0) # reset the file pointer for lg in egs: if egid in lg: print lg.strip() print next(egs).strip() egs.seek(0) # reset the file pointer 的同时调用next(iterator)将消耗一个迭代器项,如下所示:

iterator

正如你所看到的,我们不会在这里的每一个元素上都没有...鉴于你的文件格式它应该不是一个大问题,但我想我仍然会警告你。

现在你的算法效率很低 - 对于>>> it = iter(range(20)) >>> for x in it: ... print x, next(it) ... 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 文件的每一行,它将会一次又一次地扫描整个rbhems文件。

_NB:以下假设每个emid / egid在fasta文件中最多只出现一次 ._

如果您的egsems文件不是太大并且您有足够的可用内存,您可以将它们加载到一对dicts中并执行单独的dict查找(即O(1)并且可能是Python中最优化的操作之一)

egs

如果由于内存问题而没有飞行,那么运气不好就会陷入顺序扫描(除非你想使用某些数据库系统,但这可能有点矫枉过正),但是 - 总是假设emid / egid每个只在fasta文件中出现一次 - 一旦找到目标,你至少可以退出内部循环:

# warning: totally untested code

def fastamap(path):
    d = dict()
    with open(path) as f:
        for num, line in enumerate(f, 1):
            line = line.strip()
            # skip empty lines.
            if not line:
                continue

            # sanity check: we should only see 
            # lines starting with ">", the "value"
            # lines being consumed by the `next(f)` call
            if not line.startswith(">"):
                raise ValueError(
                    "in file %s: line %s doesn't start with '>'" % (
                      path, num
                    ))

            # ok, proceed
            d[line.lstrip(">")] = next(f).strip()

    return d

ems = fastamap('em_seq.fasta') 
egs = fastamap('eg_seq.fasta')
with open('rbh_res_eg-not-sec.txt') as rhb:
    for line in rhb:
        parts = line.split("\t") 
        emid, egid = parts[0].strip(), parts[1].strip()

        if emid in ems:
            print emid
            print ems[emid]

        if egid in egs:
            print egid
            print egs[egid]