使用python基于标签将XML文件拆分为多个文件

时间:2017-11-14 08:24:48

标签: python xml

我有一个大的xml文件,其中包含图像注释的详细信息。其样本如下:

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
        <tag name="Perimeter-Vivon" color="#032585"/>
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box top="253" left="166" width="56" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="229" width="61" height="21">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="290" width="58" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="361" width="56" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="417" width="63" height="22">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="486" width="63" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="504" left="329" width="51" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

我希望根据标签名称拆分此文件。这个文件有两个标签,即ScoreBoard和Perimeter。我想为每个标签创建两个不同的xmls。所需的输出如下:

表示ScoreBoard-Vivon.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box top="504" left="329" width="51" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

Perimeter-Vivon.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-Vivon" color="#032585"/>
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box top="253" left="166" width="56" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="229" width="61" height="21">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="290" width="58" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="361" width="56" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="417" width="63" height="22">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="486" width="63" height="20">
                <label>Perimeter-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

我有350-400个这样的标签。如何将它们分成单个文件。

新例子:

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-SVT" color="#f9e99c"/>
        <tag name="Perimeter-Vivon" color="#032585"/>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
        <tag name="Perimeter-StarSports" color="#12dadd"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0011.jpg">
            <box top="505" left="327" width="56" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
            <box top="218" left="387" width="67" height="24">
                <label>Perimeter-SVT</label>
            </box>
        </image>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0005.jpg">
            <box top="254" left="159" width="64" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="225" width="61" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="285" width="63" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="253" left="357" width="58" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="424" width="56" height="25">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="256" left="484" width="65" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="507" left="326" width="58" height="26">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0009.jpg">
            <box top="249" left="400" width="59" height="29">
                <label>Perimeter-StarSports</label>
            </box>
        </image>
    </images>
</dataset>

2 个答案:

答案 0 :(得分:1)

一种方法是获取原始XML,确定正在使用的<tags>,然后复制XML并删除所有不匹配的标记:

import xml.etree.ElementTree as ET
import copy

img_xml = """<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
        <tag name="Perimeter-Vivon" color="#032585"/>
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box top="253" left="166" width="56" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="229" width="61" height="21">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="290" width="58" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="361" width="56" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="417" width="63" height="22">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="486" width="63" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="504" left="329" width="51" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>
"""

root = ET.fromstring(img_xml)
tag_names = [tag.attrib['name'] for tag in root.find('tags')]

for tag_name in tag_names:
    root_copy = copy.deepcopy(root)

    # First remove unwanted tag
    for tag in root_copy.find('tags'):
        if tag.attrib['name'] != tag_name:
            tag.clear()

    # Now remove unwanted box
    for box in root_copy.findall("./images/image/box"):
        if box[0].text != tag_name:
            box.clear()

    ET.ElementTree(root_copy).write('{}.xml'.format(tag_name))

为您提供两个输出XML文件:

<强>周长-Vivon.xml

<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag /><tag color="#032585" name="Perimeter-Vivon" />
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box height="24" left="166" top="253" width="56">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="21" left="229" top="255" width="61">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="23" left="290" top="254" width="58">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="20" left="361" top="254" width="56">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="22" left="417" top="254" width="63">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="20" left="486" top="254" width="63">
                <label>Perimeter-Vivon</label>
            </box>
            <box /></image>
    </images>
</dataset>        

<强>记分牌-Vivon.xml

<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag color="#bf5786" name="ScoreBoard-Vivon" />
        <tag /></tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box /><box /><box /><box /><box /><box /><box height="29" left="329" top="504" width="51">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

答案 1 :(得分:1)

以下(XSLT 2.0)样式表:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xsl:template match="//dataset/tags">
      <xsl:for-each select="./tag">
            <xsl:variable name="tagName" select="@name" />

                <xsl:result-document method="xml" href="{$tagName}.xml">
                    <dataset>    
                        <xsl:copy-of select="/dataset/name"/>
                        <xsl:copy-of select="/dataset/comment"/>
                        <tags>
                            <xsl:copy-of select="/dataset/tags/tag[./@name = $tagName]"/>
                        </tags>
                        <images>
                        <xsl:for-each select="/dataset/images/image[./box/label/text() = $tagName]">
                            <image> 
                                <xsl:copy-of select="./@file"/>
                                <xsl:copy-of select="./box[./label[./text() = $tagName]]"/>
                            </image>
                        </xsl:for-each>
                        </images>
                    </dataset>
                </xsl:result-document>                              

      </xsl:for-each>
    </xsl:template>     
</xsl:stylesheet>

当应用于您的输入时,会产生以下结果:

周长-SVT.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-SVT" color="#f9e99c"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0011.jpg">
            <box top="218" left="387" width="67" height="24">
                <label>Perimeter-SVT</label>
            </box>
        </image>
    </images>
</dataset>

周长-Vivon.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-Vivon" color="#032585"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0005.jpg">
            <box top="254" left="159" width="64" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="225" width="61" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="285" width="63" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="253" left="357" width="58" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="424" width="56" height="25">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="256" left="484" width="65" height="23">
                <label>Perimeter-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

记分牌-Vivon.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0011.jpg">
            <box top="505" left="327" width="56" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0005.jpg">
            <box top="507" left="326" width="58" height="26">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

周长-StarSports.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-StarSports" color="#12dadd"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0009.jpg">
            <box top="249" left="400" width="59" height="29">
                <label>Perimeter-StarSports</label>
            </box>
        </image>
    </images>
</dataset>