如何在指定的父标记XML Python中包装元素?

时间:2020-04-16 08:34:26

标签: python xml tags lxml elementtree

我有这个XML:

<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="191.745,592.218,249.042,603.578">
<textline bbox="191.745,592.218,249.042,603.578">
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" colourspace="DeviceGray" ncolour="0" size="11.360">A</text>
<text font="NUMPTY+ImprintMTnum" bbox="199.227,592.218,205.657,603.578" colourspace="DeviceGray" ncolour="0" size="11.360">P</text>
<text font="NUMPTY+ImprintMTnum" bbox="205.545,592.218,211.975,603.578" colourspace="DeviceGray" ncolour="0" size="11.360">P</text>
<text font="NUMPTY+ImprintMTnum" bbox="211.023,592.218,218.617,603.578" colourspace="DeviceGray" ncolour="0" size="11.360">A</text>
<text font="NUMPTY+ImprintMTnum" bbox="218.515,592.218,226.109,603.578" colourspace="DeviceGray" ncolour="0" size="11.360">R</text>
<text font="NUMPTY+ImprintMTnum" bbox="226.008,592.218,233.602,603.578" colourspace="DeviceGray" ncolour="0" size="11.360">A</text>
<text font="NUMPTY+ImprintMTnum" bbox="232.812,592.218,240.932,603.578" colourspace="DeviceGray" ncolour="0" size="11.360">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="240.922,592.218,249.042,603.578" colourspace="DeviceGray" ncolour="0" size="11.360">O</text>
</textline>
</textbox>
<textbox id="1" bbox="44.614,554.008,58.101,564.246">
<textline bbox="44.614,554.008,58.101,564.246">
<text font="NUMPTY+ImprintMTnum" bbox="44.614,554.008,49.369,564.246" colourspace="DeviceGray" ncolour="0" size="10.238">2</text>
<text font="NUMPTY+ImprintMTnum" bbox="49.268,554.008,54.022,564.246" colourspace="DeviceGray" ncolour="0" size="10.238">4</text>
<text font="NUMPTY+ImprintMTnum" bbox="53.922,554.008,58.101,564.246" colourspace="DeviceGray" ncolour="0" size="10.238">a</text>
</textline>
</textbox>
<textbox id="2" bbox="43.563,475.008,58.117,485.246">
<textline bbox="43.563,475.008,58.117,485.246">
<text font="NUMPTY+ImprintMTnum" bbox="43.563,475.008,48.317,485.246" colourspace="DeviceGray" ncolour="0" size="10.238">2</text>
<text font="NUMPTY+ImprintMTnum" bbox="48.226,475.008,52.980,485.246" colourspace="DeviceGray" ncolour="0" size="10.238">4</text>
<text font="NUMPTY+ImprintMTnum" bbox="52.889,475.008,58.117,485.246" colourspace="DeviceGray" ncolour="0" size="10.238">b</text>
</textline>
</textbox>
</page>
</pages>

虽然更长,但是结构相同。

我想在每次指定一定距离(<newline>属性的第一个数字与第一个数字和第一个数字之间)之间插入一个bbox 标签。下一个bbox属性)。我希望仅在需要打开另一个标签时才关闭标签。像这样:

<newline>
   <text tags>[...]</text tags>
</newline>
<newline>
   <text tags>[...]</text tags]
</newline>

因此,换行符会包装文本标签。我尝试使用我的代码,但输出不起作用。我的代码如下:

import lxml.etree as etree
from lxml.builder import E

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('fe3.xml', parser)
root = tree.getroot()

def removeByIdx(parent, idx):
    currElem = parent[idx]   # The indicated element
    parent.remove(currElem)  # Remove it from the parent
    return currElem          # Return the index and element

def wrap(line, idxList):
    if len(idxList) == 0:
        return    # No elements to wrap
    # Take the first element from the original location
    idx = idxList.pop(0)     # Index of the first element
    elem = removeByIdx(line, idx) # The indicated element
    # Create "newline" element with "elem" inside
    nElem = E.newline(elem)
    line.insert(idx, nElem)  # Put it in place of "elem"
    while len(idxList) > 0:  # Process the rest of index list
        # Value not used, but must be removed
        idxList.pop(0)
        # Remove the current element from the original location
        currElem = removeByIdx(line, idx + 1)
        nElem.append(currElem)  # Append it to "newline"


global distance
distance = 0
x_prev = None
for x in tree.xpath('//text'):
    idxList = []
    bb = x.attrib.get('bbox')
    if bb is not None:
        bb = bb.split(',')
        #print('This: ', bb)

        if x_prev is not None:
            #print('  Previous: ', x_prev)
            distance = float(bb[0]) - float(x_prev[0])

        else:
            print('  No previous bbox')
        # Store this bounding box for the next loop (to be used as x_prev)
        x_prev = bb

        if distance > 20:
            for elem in x:
                par = elem.getparent()
                idx = par.index(elem)
                idxList.append(idx)
        else:  # "Wrong" element, wrap elements "gathered" so far
                wrap(x, idxList)
                idxList = []
            # Process "good" elements without any "bad" after them, if any
        wrap(x, idxList)
print(etree.tostring(root, encoding='unicode', pretty_print=True))

0 个答案:

没有答案
相关问题