解析.txt文件以创建制表符分隔的输出文件

时间:2018-05-14 12:28:43

标签: python csv

MacOS,python 2.7

我正在尝试解析.txt文件并提取我想要创建制表符分隔表的字符串。我将不得不为许多文件执行此操作,但我在选择某些字符串时遇到问题。

以下是输入文件示例:

# Assembly name:  ASM1844v1
# Organism name:  Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name:  strain=ACICU
# Taxid:          405416
# BioSample:      SAMN02603140
# BioProject:     PRJNA17827
# Submitter:      CNR - National Research Council
# Date:           2008-4-15
# Assembly type:  n/a
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession   RefSeq Unit Accession   Assembly-Unit name
## GCA_000018455.1  GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn    Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
ANONYMOUS   assembled-molecule  na  Chromosome
CP000863.1  =   NC_010611.1 Primary Assembly    3904116 na
pACICU1 assembled-molecule  pACICU1 Plasmid CP000864.1  =   NC_010605.1 Primary Assembly    28279   na
pACICU2 assembled-molecule  pACICU2 Plasmid CP000865.1  =   NC_010606.1 Primary Assembly    64366   na

到目前为止,我的代码如下所示,headtring指示列标题:

# Open the input file for reading 
InFile = open(InFileName, 'r')
#f = open(InFileName, 'r')

# Write the header
Headstring= "GenBank_Assembly_ID    RefSeq_Assembly_ID  Assembly_level  Chromosome Plasmid  Refseq_chromosome   Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5"

# Set up chromosome and plasmid count
ccount = 0
pcount = 0

# Look for corresponding data from each file
with open(InFileName, 'r') as searchfile:
    for line in searchfile:
        if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
            GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
            print GCA.group(1)
            GCA = GCA.group(1)
        if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
            GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
            print GCF.group(1)
            GCF = GCF.group(1) 
        if re.search ( r'level: (.+$)', line, re.M|re.I):
            assembly = re.search( r'level: (.+$)', line, re.M|re.I)
            print assembly.group(1)
            assembly = assembly.group(1)
        if "Chromosome" in line:
            ccount += 1
            print ccount
        if "Plasmid" in line:
            pcount += 1
            print pcount



OutputString = "%s\t%s\t%s\t%s\t%s\t" % (GCA, GCF, assembly, ccount, pcount)


OutFile=open(OutFileName, 'w')
OutFile.write(Headstring+'\n'+OutputString)


InFile.close()
OutFile.close()

我遇到的主要问题是我想提取字符串NC_010611.1,NC_010605.1和NC_010606.1,并且它们之间的标签空间在同一行上,因此它们最终位于标题下Refseq_chromosome,Refseq_plasmid1和Refseq_plasmid2分别。但我只想让脚本搜索这些如果汇编="染色体"或者"完整的基因组"。我不确定如果这个条件成立,如何搜索字符串。

我知道获取这些字符串的正则表达式可以是' = \ t(\ w + ..)',但就我而言。

我对python很新,所以解释会很棒。提前谢谢!

2 个答案:

答案 0 :(得分:3)

看一下这个例子:

import re

InFileName  = 'YOUR_INPUT_FILE_NAME'
OutFileName = 'YOUR_OUTPUT_FILE_NAME'

# Write the header
Headstring= "GenBank_Assembly_ID\tRefSeq_Assembly_ID\tAssembly_level\tChromosome\tPlasmid\tRefseq_chromosome\tRefseq_plasmid1\tRefseq_plasmid2\tRefseq_plasmid3\tRefseq_plasmid4\tRefseq_plasmid5"

# Look for corresponding data from each file
with open(InFileName, 'r') as InFile, open(OutFileName, 'w') as OutFile:
    chromosomes = []
    plasmids = []
    for line in InFile:
        if line.lstrip()[0] == '#':
            # Process header part of the file differently from the data part
            if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
                GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
                print GCA.group(1)
                GCA = GCA.group(1)
            if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
                GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
                print GCF.group(1)
                GCF = GCF.group(1)
            if re.search ( r'level: (.+$)', line, re.M|re.I):
                assembly = re.search( r'level: (.+$)', line, re.M|re.I)
                print assembly.group(1)
                assembly = assembly.group(1)
        elif assembly in ['Chromosome', 'Complete Genome']:
            # Process each data line separately
            split_line = line.split()
            Type = split_line[3]
            RefSeq_Accn = split_line[6]
            if Type == "Chromosome":
                chromosomes.append(RefSeq_Accn)
            if Type == "Plasmid":
                plasmids.append(RefSeq_Accn)

    # Merge names of up to N chromosomes
    N = 1
    cstr = ''
    for i in range(N):
        if i < len(chromosomes):
            nextChromosome = chromosomes[i]
        else:
            nextChromosome = ''
        cstr += '\t' + nextChromosome

    # Merge names of up to M plasmids
    M = 5
    pstr = ''
    for i in range(M):
        if i < len(plasmids):
            nextPlasmid = plasmids[i]
        else:
            nextPlasmid = ''
        pstr += '\t' + nextPlasmid

    OutputString = "%s\t%s\t%s\t%s\t%s" % (GCA, GCF, assembly, len(chromosomes), len(plasmids))
    OutputString += cstr
    OutputString += pstr

    OutFile.write(Headstring+'\n'+OutputString)

输入:

# Assembly name:  ASM1844v1
# Organism name:  Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name:  strain=ACICU
# Taxid:          405416
# BioSample:      SAMN02603140
# BioProject:     PRJNA17827
# Submitter:      CNR - National Research Council
# Date:           2008-4-15
# Assembly type:  n/a
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession   RefSeq Unit Accession   Assembly-Unit name
## GCA_000018455.1  GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn     Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
ANONYMOUS   assembled-molecule  na  Chromosome CP000863.1  =   NC_010611.1 Primary Assembly    3904116 na
pACICU1 assembled-molecule  pACICU1 Plasmid CP000864.1  =   NC_010605.1 Primary Assembly    28279   na
pACICU2 assembled-molecule  pACICU2 Plasmid CP000865.1  =   NC_010606.1 Primary Assembly    64366   na

输出:

GenBank_Assembly_ID  RefSeq_Assembly_ID      Assembly_level  Chromosome  Plasmid Refseq_chromosome  Refseq_plasmid1 Refseq_plasmid2  Refseq_plasmid3 Refseq_plasmid4  Refseq_plasmid5
GCA_000018445.1      GCF_000018445.1         Complete Genome 1           2       NC_010611.1        NC_010605.1     NC_010606.1

与您的脚本的主要区别:

  • 我使用条件if line.lstrip()[0] == '#'来处理&#34;标题&#34;行(以散列字符开头的行)与&#34;表行&#34;不同;在底部(实际包含每个序列的数据的行)。
  • 我使用条件if assembly in ['Chromosome', 'Complete Genome'] - 这是您在问题中指定的条件
  • 我将每个表格行拆分为像split_line = line.split()这样的值。之后我按Type = split_line[3]获取了类型(这是表格数据中的第四列),RefSeq_Accn = split_line[6]给了我表格中的第七列。

答案 1 :(得分:0)

您可以先将所有数据读入pandas数据帧,然后再开始使用。 然后你可以以一种以另一列为条件的方式处理一个列(无论包含&#39; NC_010611.1&#39;)。请参阅此处的示例:Pandas conditional creation of a series/dataframe column

可能在一次通过数据时可以获得您想要的内容,但如果您通过数据进行2次传递,则可能更容易编写和读取。