使用xml.etree.ElementTree获取节点的所有子节点

时间:2018-09-10 13:26:46

标签: xml python-3.x xpath

亲爱的,我正在尝试使用python version3解析xml文件中的一些数据。这是我的xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- Created on Fri Sep 07 08:20:37 WAT 2018 with ROAMSMART IREG-360 // www.roam-smart.com -->
<tadig-raex-21:TADIGRAEXIR21 xmlns:tadig-raex-21="https://infocentre.gsm.org/TADIG-RAEX-IR21" xmlns:ns2="https://infocentre.gsm.org/TADIG-GEN">
    <tadig-raex-21:RAEXIR21FileHeader>
        <tadig-raex-21:FileCreationTimestamp>2018-01-08T15:42:21+01:00</tadig-raex-21:FileCreationTimestamp>
        <tadig-raex-21:FileType>IR.21</tadig-raex-21:FileType>
        <tadig-raex-21:SenderTADIG>DEMO</tadig-raex-21:SenderTADIG>
        <tadig-raex-21:PublishComment>Update</tadig-raex-21:PublishComment>
        <tadig-raex-21:TADIGGenSchemaVersion>2.4</tadig-raex-21:TADIGGenSchemaVersion>
        <tadig-raex-21:TADIGRAEXIR21SchemaVersion>10.1</tadig-raex-21:TADIGRAEXIR21SchemaVersion>
    </tadig-raex-21:RAEXIR21FileHeader>
    <tadig-raex-21:OrganisationInfo>
        <tadig-raex-21:OrganisationName>DEMO</tadig-raex-21:OrganisationName>
        <tadig-raex-21:CountryInitials>FRA</tadig-raex-21:CountryInitials>
        <tadig-raex-21:NetworkList>
            <tadig-raex-21:Network>
                <tadig-raex-21:TADIGCode>DEMO</tadig-raex-21:TADIGCode>
                <tadig-raex-21:NetworkType>Terrestrial</tadig-raex-21:NetworkType>
                <tadig-raex-21:NetworkData>
                    <tadig-raex-21:IPRoaming_IW_InfoSection>
                        <tadig-raex-21:IPRoaming_IW_Info_General>
                            <tadig-raex-21:EffectiveDateOfChange>2013-07-01</tadig-raex-21:EffectiveDateOfChange>
                            <tadig-raex-21:PMNAuthoritativeDNSIPList>
                                <tadig-raex-21:DNSitem>
                                    <tadig-raex-21:IPAddress>212.234.96.11</tadig-raex-21:IPAddress>
                                    <tadig-raex-21:DNSname>PMASDNS1.mnc001.mcc208.gprs</tadig-raex-21:DNSname>
                                </tadig-raex-21:DNSitem>
                                <tadig-raex-21:DNSitem>
                                    <tadig-raex-21:IPAddress>212.234.96.74</tadig-raex-21:IPAddress>
                                    <tadig-raex-21:DNSname>LYLADNS1.mnc001.mcc208.gprs</tadig-raex-21:DNSname>
                                </tadig-raex-21:DNSitem>
                                <tadig-raex-21:DNSitem>
                                    <tadig-raex-21:IPAddress>212.234.96.11</tadig-raex-21:IPAddress>
                                    <tadig-raex-21:DNSname>PMASDNS1.mnc001.mcc208.3gppnetwork.org</tadig-raex-21:DNSname>
                                </tadig-raex-21:DNSitem>
                                <tadig-raex-21:DNSitem>
                                    <tadig-raex-21:IPAddress>212.234.96.74</tadig-raex-21:IPAddress>
                                    <tadig-raex-21:DNSname>LYLADNS1.mnc001.mcc208.3gppnetwork.org</tadig-raex-21:DNSname>
                                </tadig-raex-21:DNSitem>
                            </tadig-raex-21:PMNAuthoritativeDNSIPList>
                        </tadig-raex-21:IPRoaming_IW_Info_General>
                    </tadig-raex-21:IPRoaming_IW_InfoSection>
                </tadig-raex-21:NetworkData>
                <tadig-raex-21:HostedNetworksInfo>
                    <tadig-raex-21:SectionNA>Section not applicable</tadig-raex-21:SectionNA>
                </tadig-raex-21:HostedNetworksInfo>
                <tadig-raex-21:PresentationOfCountryInitialsAndMNN>DEMO FR</tadig-raex-21:PresentationOfCountryInitialsAndMNN>
                <tadig-raex-21:AbbreviatedMNN>DEMO</tadig-raex-21:AbbreviatedMNN>
                <tadig-raex-21:NetworkColourCode>1</tadig-raex-21:NetworkColourCode>
            </tadig-raex-21:Network>
        </tadig-raex-21:NetworkList>
    </tadig-raex-21:OrganisationInfo>
</tadig-raex-21:TADIGRAEXIR21>

我需要从“所有DNS项”中获取所有IP地址,并将它们保存到将在csv文件中导出的列表中。 IP记录将在每一行中与TADIG关联。

我从此链接中得到启发(Getting all instances of child node using xml.etree.ElementTree),这是我的代码:

from xml.etree import ElementTree as ET

out = csv.writer(open("result.csv", "w"), delimiter=',', quoting=csv.QUOTE_ALL)
# loop through directory for and parse all xml file
directory = "C:\\Users\\Walid Ben Chamekh\\PycharmProjects\\dnsparser\\com\\ir21\\dnsparser\\"

# start parsing
print("Start parsing")
for filename in os.listdir(directory):
    if filename.endswith(".xml"):
        print(filename)
        root = ET.parse(filename).getroot()
        # get Network TADIG code
        raexFileHeader = root.getchildren()[0]
        tadig = raexFileHeader.getchildren()[2].text

        try:
            DNS = root.findall(
                ".//tadig-raex-21:OrganisationInfo/tadig-raex-21:NetworkList/tadig-raex-21:Network["
                "1]/tadig-raex-21:NetworkData/tadig-raex-21:IPRoaming_IW_InfoSection/tadig-raex-21"
                ":IPRoaming_IW_Info_General/tadig-raex-21:PMNAuthoritativeDNSIPList")
        except Exception:
            print("no data")
            continue

        # get all IPs from all dns items
        for item in DNS.getchildren():
            IPresult = [tadig]
            ip = item.getchildren()[0].text
            IPresult.append(ip)
            print(IPresult)
            out.writerow(IPresult)
        continue
    else:
        continue
# End Parsing
print("End Parsing")

它不起作用,DNS列表总是空的!!谢谢您的帮助

1 个答案:

答案 0 :(得分:0)

问题在于ElementTree在名称空间方面不是很聪明。在对MorphTofind()findall()的调用中,您需要传递一个包含命名空间的字典,该命名空间可在以下答案中找到:https://stackoverflow.com/a/14853417/2044940

iterfind()

通过此更改和其他一些更改,我得以使其返回以下数据:

namespaces = { "tadig-raex-21": "https://infocentre.gsm.org/TADIG-RAEX-IR21" }
root.findall("...", namespaces)

这是Python脚本。请注意,您需要使用输入XML为其提供一个['DEMO', '212.234.96.11'] ['DEMO', '212.234.96.74'] ['DEMO', '212.234.96.11'] ['DEMO', '212.234.96.74']

filename

也可以不使用名称空间字典,但是完整的名称空间URI需要在​​花括号中用作前缀(找到here):

from xml.etree import ElementTree as ET

# Doesn't help, it is only used for serialization, i.e. writing XML, but not parsing
#ET.register_namespace("tadig-raex-21", "https://infocentre.gsm.org/TADIG-RAEX-IR21")

# Dictionary of namespaces, needed to avoid error:
# -> SyntaxError: prefix 'tadig-raex-21' not found in prefix map
namespaces = {
    "tadig-raex-21": "https://infocentre.gsm.org/TADIG-RAEX-IR21"
}

root = ET.parse(filename).getroot()

# Fetch SenderTADIG by path
# TODO: handle case if the element doesn't exist
tadig = root.find(
    "tadig-raex-21:RAEXIR21FileHeader/"
    "tadig-raex-21:SenderTADIG", namespaces).text

# Select DNSitems for further processing
DNS = root.findall(
    "tadig-raex-21:OrganisationInfo/"
    "tadig-raex-21:NetworkList/"
    "tadig-raex-21:Network[1]/"
    "tadig-raex-21:NetworkData/"
    "tadig-raex-21:IPRoaming_IW_InfoSection/"
    "tadig-raex-21:IPRoaming_IW_Info_General/"
    "tadig-raex-21:PMNAuthoritativeDNSIPList/"
    "tadig-raex-21:DNSitem", namespaces)

# DNS is a list of elements, can't call getchildren() on it directly!
for item in DNS:
    IPresult = [tadig]
    # It's safer to fetch the IPAddress via the element name
    ip = item.find("tadig-raex-21:IPAddress", namespaces).text
    IPresult.append(ip)
    print(IPresult)

有趣的是,似乎无法确定具有命名空间的根元素的属性(这可能使我们能够从中生成命名空间dict):

tadig = root.find(
  "{https://infocentre.gsm.org/TADIG-RAEX-IR21}RAEXIR21FileHeader/"
  "{https://infocentre.gsm.org/TADIG-RAEX-IR21}SenderTADIG").text

根元素包含名称空间信息:

# Empty dict
ET.parse(filename).getroot().attrib

您不能将名称空间命令传递给<tadig-raex-21:TADIGRAEXIR21 xmlns:tadig-raex-21="https://infocentre.gsm.org/TADIG-RAEX-IR21" xmlns:ns2="https://infocentre.gsm.org/TADIG-GEN"> ,因此不知道是否或如何获取属性getroot()xmlns:tadig-raex-21的值。