Question

MacOS，python 2.7

我正在尝试解析.txt文件并提取我想要创建制表符分隔表的字符串。我将不得不为许多文件执行此操作，但我在选择某些字符串时遇到问题。

以下是输入文件示例：

# Assembly name:  ASM1844v1
# Organism name:  Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name:  strain=ACICU
# Taxid:          405416
# BioSample:      SAMN02603140
# BioProject:     PRJNA17827
# Submitter:      CNR - National Research Council
# Date:           2008-4-15
# Assembly type:  n/a
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession   RefSeq Unit Accession   Assembly-Unit name
## GCA_000018455.1  GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn    Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
ANONYMOUS   assembled-molecule  na  Chromosome
CP000863.1  =   NC_010611.1 Primary Assembly    3904116 na
pACICU1 assembled-molecule  pACICU1 Plasmid CP000864.1  =   NC_010605.1 Primary Assembly    28279   na
pACICU2 assembled-molecule  pACICU2 Plasmid CP000865.1  =   NC_010606.1 Primary Assembly    64366   na

到目前为止，我的代码如下所示，headtring指示列标题：

# Open the input file for reading 
InFile = open(InFileName, 'r')
#f = open(InFileName, 'r')

# Write the header
Headstring= "GenBank_Assembly_ID    RefSeq_Assembly_ID  Assembly_level  Chromosome Plasmid  Refseq_chromosome   Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5"

# Set up chromosome and plasmid count
ccount = 0
pcount = 0

# Look for corresponding data from each file
with open(InFileName, 'r') as searchfile:
    for line in searchfile:
        if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
            GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
            print GCA.group(1)
            GCA = GCA.group(1)
        if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
            GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
            print GCF.group(1)
            GCF = GCF.group(1) 
        if re.search ( r'level: (.+$)', line, re.M|re.I):
            assembly = re.search( r'level: (.+$)', line, re.M|re.I)
            print assembly.group(1)
            assembly = assembly.group(1)
        if "Chromosome" in line:
            ccount += 1
            print ccount
        if "Plasmid" in line:
            pcount += 1
            print pcount



OutputString = "%s\t%s\t%s\t%s\t%s\t" % (GCA, GCF, assembly, ccount, pcount)


OutFile=open(OutFileName, 'w')
OutFile.write(Headstring+'\n'+OutputString)


InFile.close()
OutFile.close()

我遇到的主要问题是我想提取字符串NC_010611.1，NC_010605.1和NC_010606.1，并且它们之间的标签空间在同一行上，因此它们最终位于标题下Refseq_chromosome，Refseq_plasmid1和Refseq_plasmid2分别。但我只想让脚本搜索这些如果汇编=＆＃34;染色体＆＃34;或者＆＃34;完整的基因组＆＃34;。我不确定如果这个条件成立，如何搜索字符串。

我知道获取这些字符串的正则表达式可以是＆＃39; = \ t（\ w + ..）＆＃39;，但就我而言。

我对python很新，所以解释会很棒。提前谢谢！

Answer 1

看一下这个例子：

import re

InFileName  = 'YOUR_INPUT_FILE_NAME'
OutFileName = 'YOUR_OUTPUT_FILE_NAME'

# Write the header
Headstring= "GenBank_Assembly_ID\tRefSeq_Assembly_ID\tAssembly_level\tChromosome\tPlasmid\tRefseq_chromosome\tRefseq_plasmid1\tRefseq_plasmid2\tRefseq_plasmid3\tRefseq_plasmid4\tRefseq_plasmid5"

# Look for corresponding data from each file
with open(InFileName, 'r') as InFile, open(OutFileName, 'w') as OutFile:
    chromosomes = []
    plasmids = []
    for line in InFile:
        if line.lstrip()[0] == '#':
            # Process header part of the file differently from the data part
            if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
                GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
                print GCA.group(1)
                GCA = GCA.group(1)
            if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
                GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
                print GCF.group(1)
                GCF = GCF.group(1)
            if re.search ( r'level: (.+$)', line, re.M|re.I):
                assembly = re.search( r'level: (.+$)', line, re.M|re.I)
                print assembly.group(1)
                assembly = assembly.group(1)
        elif assembly in ['Chromosome', 'Complete Genome']:
            # Process each data line separately
            split_line = line.split()
            Type = split_line[3]
            RefSeq_Accn = split_line[6]
            if Type == "Chromosome":
                chromosomes.append(RefSeq_Accn)
            if Type == "Plasmid":
                plasmids.append(RefSeq_Accn)

    # Merge names of up to N chromosomes
    N = 1
    cstr = ''
    for i in range(N):
        if i < len(chromosomes):
            nextChromosome = chromosomes[i]
        else:
            nextChromosome = ''
        cstr += '\t' + nextChromosome

    # Merge names of up to M plasmids
    M = 5
    pstr = ''
    for i in range(M):
        if i < len(plasmids):
            nextPlasmid = plasmids[i]
        else:
            nextPlasmid = ''
        pstr += '\t' + nextPlasmid

    OutputString = "%s\t%s\t%s\t%s\t%s" % (GCA, GCF, assembly, len(chromosomes), len(plasmids))
    OutputString += cstr
    OutputString += pstr

    OutFile.write(Headstring+'\n'+OutputString)

输入：

# Assembly name:  ASM1844v1
# Organism name:  Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name:  strain=ACICU
# Taxid:          405416
# BioSample:      SAMN02603140
# BioProject:     PRJNA17827
# Submitter:      CNR - National Research Council
# Date:           2008-4-15
# Assembly type:  n/a
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession   RefSeq Unit Accession   Assembly-Unit name
## GCA_000018455.1  GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn     Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
ANONYMOUS   assembled-molecule  na  Chromosome CP000863.1  =   NC_010611.1 Primary Assembly    3904116 na
pACICU1 assembled-molecule  pACICU1 Plasmid CP000864.1  =   NC_010605.1 Primary Assembly    28279   na
pACICU2 assembled-molecule  pACICU2 Plasmid CP000865.1  =   NC_010606.1 Primary Assembly    64366   na

输出：

GenBank_Assembly_ID  RefSeq_Assembly_ID      Assembly_level  Chromosome  Plasmid Refseq_chromosome  Refseq_plasmid1 Refseq_plasmid2  Refseq_plasmid3 Refseq_plasmid4  Refseq_plasmid5
GCA_000018445.1      GCF_000018445.1         Complete Genome 1           2       NC_010611.1        NC_010605.1     NC_010606.1

与您的脚本的主要区别：

我使用条件if line.lstrip()[0] == '#'来处理＆＃34;标题＆＃34;行（以散列字符开头的行）与＆＃34;表行＆＃34;不同;在底部（实际包含每个序列的数据的行）。
我使用条件if assembly in ['Chromosome', 'Complete Genome'] - 这是您在问题中指定的条件
我将每个表格行拆分为像split_line = line.split()这样的值。之后我按Type = split_line[3]获取了类型（这是表格数据中的第四列），RefSeq_Accn = split_line[6]给了我表格中的第七列。

Answer 2

您可以先将所有数据读入pandas数据帧，然后再开始使用。然后你可以以一种以另一列为条件的方式处理一个列（无论包含＆＃39; NC_010611.1＆＃39;）。请参阅此处的示例：Pandas conditional creation of a series/dataframe column。

可能在一次通过数据时可以获得您想要的内容，但如果您通过数据进行2次传递，则可能更容易编写和读取。

解析.txt文件以创建制表符分隔的输出文件

2 个答案: