根据子元素的条件删除 XML 父元素 - Python

时间:2021-05-13 20:03:03

标签: python xml parsing automation metadata

我试图根据包含“nan”值的特定子元素的文本删除父 XML 元素。输入 XML 包含命名空间,这使得这比预期的要棘手,我可以单独删除选择的子元素,但不能删除关联/相邻的父元素。我只能删除与 gam:String 元素关联的“nan”值,但我想删除所有带有“nan”文本值的子元素及其关联的父元素。

以下是我正在使用的脚本,以及输入和(所需的)输出 XML....任何帮助都非常感谢!

脚本:

from lxml import etree
import os 

path = "C:\\users\\mdl518\\Desktop\\"

### Removing "Nan" Values
tree = etree.parse(os.path.join(path,"metadata_info.xml"))

for elem in tree_2.findall('.//{http://standards.iso.org/iso/19115/-3/gam/1.0}String'):
   if elem.text=='nan':
     parent = elem.getparent()
     parent.remove(elem)
    
with open(".//metadata_output.xml","wb") as f:
    f.write(etree.tostring(tree_2, xml_declaration=True, encoding='utf-8')) ## Removes elements with "nan" values

输入 XML:

<?xml version='1.0' encoding='utf-8'?>
<nas:metadata xmlns:nas="http://www.arcgis.com/schema/nas/base"   
xmlns:mcc="http://standards.org/iso/19115/-3/mcc/1.0"    
xmlns:mdl="http://standards.org/iso/19115/-3/mdl/1.0" 
xmlns:mnl="http://standards.org/iso/19115/-3/mnl/1.0">
xmlns:lan="http://standards.org/iso/19115/-3/lan/1.0">
xmlns:lis="http://standards.org/iso/19115/-3/lis/1.0">
xmlns:gam="http://standards.org/iso/19115/-3/gam/1.0">
  <mdl:metadataIdentifier>
    <mcc:MD_Identifier>
      <mnl:name>
        <mnl:type>
          <gam:String>The Metadata File</gam:String>
        </mnl:type>
        <mnl:description>
          <mcc:listing codeList="http://arcgis.com/codelist/ScopeCode" codeListValue="dataset"</mcc:listing>
        </mnl:description>
      </mnl:name>
      <mnl:address>
        <mnl:defaultLocale>
          <lan:location>nan</lan:location>
        </mnl:defaultLocale>
      </mnl:address>
      <lan:language>
        <lan:type>
          <lis:name>English</lis:name>
        </lan:type>
       </lan:language>
     </mcc:MD_Identifier>
     <mcc:contactInfo>
       <mdl:POC>
         <mnl:name>
           <lis:person>Tom</lis:person>
         </mnl:name>
         <mnl:age>
           <gam:String>nan</gam:String>
         </mnl:age>
         <mnl:status>
           <lis:employment>nan</lis:employment>
         </mnl:status>
       </mdl:POC>
     </mcc:contactInfo>
   </mdl:metadataIdentifier>
 </nas:metadata>

输出 XML:

<?xml version='1.0' encoding='utf-8'?>
<nas:metadata xmlns:nas="http://www.arcgis.com/schema/nas/base"   
xmlns:mcc="http://standards.org/iso/19115/-3/mcc/1.0"    
xmlns:mdl="http://standards.org/iso/19115/-3/mdl/1.0" 
xmlns:mnl="http://standards.org/iso/19115/-3/mnl/1.0">
xmlns:lan="http://standards.org/iso/19115/-3/lan/1.0">
xmlns:lis="http://standards.org/iso/19115/-3/lis/1.0">
xmlns:gam="http://standards.org/iso/19115/-3/gam/1.0">
  <mdl:metadataIdentifier>
    <mcc:MD_Identifier>
      <mnl:name>
        <mnl:type>
          <gam:String>The Metadata File</gam:String>
        </mnl:type>
        <mnl:description>
          <mcc:listing codeList="http://arcgis.com/codelist/ScopeCode" codeListValue="dataset"</mcc:listing>
        </mnl:description>
      </mnl:name>
      <lan:language>
        <lan:type>
          <lis:name>English</lis:name>
        </lan:type>
       </lan:language>
     </mcc:MD_Identifier>
     <mcc:contactInfo>
       <mdl:POC>
         <mnl:name>
           <lis:person>Tom</lis:person>
         </mnl:name>
       </mdl:POC>
     </mcc:contactInfo>
   </mdl:metadataIdentifier>
 </nas:metadata>

1 个答案:

答案 0 :(得分:1)

这必须分两个阶段完成:首先删除所有带有 nan 文本节点的节点,然后遍历第一步创建的空节点并将它们也删除:

#step 1 - remove nan nodes
for n in tree.xpath('//*[.="nan"]'):
    n.getparent().remove(n)]

#step 2 - select empty nodes and remove them as well
empty = [e for e in doc.xpath('//*[not(normalize-space())]')]

for emp in empty:
    try:
        emp.getparent().remove(emp)
    #one nested empty node is created by the first step; this step removes both nodes so try/except is necessary:
    except:
        continue
print(etree.tostring(doc).decode())

这应该会得到你想要的输出。

相关问题