如何在python中解析具有相同子标记的xml文件?

时间:2017-09-25 10:29:59

标签: python xml parsing lxml

<?xml version="1.0"?>
<BioSampleSet>
  <BioSample accession="SAMN01347139" id="1347139" submission_date="2012-09-21T22:44:26.843" last_update="2012-09-21T22:44:26.843" publication_date="2012-09-21T22:44:26.843" access="controlled-access">
    <Ids>
      <Id is_primary="1" db="BioSample">SAMN01347139</Id>
      <Id db="dbGaP" is_hidden="1" db_label="Sample name">44-21834</Id>
    </Ids>
    <Description>
      <Title>DNA sample from a human male participant in the dbGaP study "Framingham SHARe Thyroid and Hormone Data"</Title>
      <Organism taxonomy_name="Homo sapiens" taxonomy_id="9606"/>
    </Description>
    <Owner>
      <Name abbreviation="NCBI"/>
    </Owner>
    <Models>
      <Model>Generic</Model>
    </Models>
    <Package display_name="Generic">Generic.1.0</Package>
    <Attributes>
      <Attribute display_name="gap accession" harmonized_name="gap_accession" attribute_name="gap_accession">phs000044</Attribute>
      <Attribute display_name="submitter handle" harmonized_name="submitter_handle" attribute_name="submitter handle">Framingham_SHARe</Attribute>
      <Attribute display_name="biospecimen repository" harmonized_name="biospecimen_repository" attribute_name="biospecimen repository">Framingham_SHARe</Attribute>
      <Attribute display_name="study name" harmonized_name="study_name" attribute_name="study name">Framingham SHARe Thyroid and Hormone Data</Attribute>
      <Attribute display_name="biospecimen repository sample id" harmonized_name="biospecimen_repository_sample_id" attribute_name="biospecimen repository sample id">21834</Attribute>
      <Attribute display_name="submitted sample id" harmonized_name="submitted_sample_id" attribute_name="submitted sample id">21834</Attribute>
      <Attribute display_name="submitted subject id" harmonized_name="submitted_subject_id" attribute_name="submitted subject id">21834</Attribute>
      <Attribute display_name="gap sample id" harmonized_name="gap_sample_id" attribute_name="gap_sample_id">105542</Attribute>
      <Attribute display_name="gap subject id" harmonized_name="gap_subject_id" attribute_name="gap_subject_id">28577</Attribute>
      <Attribute display_name="sex" harmonized_name="sex" attribute_name="sex">male</Attribute>
      <Attribute display_name="analyte type" harmonized_name="analyte_type" attribute_name="analyte type">DNA</Attribute>
      <Attribute display_name="subject is affected" harmonized_name="subject_is_affected" attribute_name="subject is affected"/>
      <Attribute display_name="gap consent code" harmonized_name="gap_consent_code" attribute_name="gap_consent_code">1</Attribute>
      <Attribute display_name="gap consent short name" harmonized_name="gap_consent_short_name" attribute_name="gap_consent_short_name">GRU</Attribute>
    </Attributes>
    <Status when="2012-09-21T22:44:26.843" status="suppressed"/>
  </BioSample>
</BioSampleSet>

我想以编程方式解析上面给出的xml文件。我尝试使用lxml,但在提取<Attributes>标记中的键和值时遇到问题,因为所有子标记都被命名为属性。任何人都有任何建议。 我尝试使用“属性”作为正则表达式来拆分文本,但由于整个文件是一行,因此结果列表是指定部分的字母表列表。 我正在使用python。 <Attribute>标签的数量可能会不时变化。 我目前正在使用以下代码:

from lxml import objectify 
import Bio.Entrez as Entrez
meta_data = Entrez.efetch(db="biosample",id=sra_id, rettype="runinfo").read()

tree = objectify.fromstring(meta_data)
print(tree.BioSample.Attributes.submitter_handle)

0 个答案:

没有答案
相关问题