Question

我要解析的XML很大（> 100 GB）。我在每次迭代中都删除了元素，也删除了根元素。但是，仍然会泄漏一些内存，这变得相当大（超过200-300 MB）当文件很大时。

XML结构如下：

<source>
<items>
    <item>
        <id></id>
        <title></title>
        <category></category>
        <city></city>
    </item>
    .
    .
    .
    .
    <item>
        <id></id>
        <title></title>
        <category></category>
        <city></city>
    </item>
</items>
</source>

我的代码：

from urllib.request import urlopen
import gzip
from lxml import etree
import csv

response = urlopen(url)

response_headers = response.info()
if response_headers.get('Content-Encoding') == 'gzip' or 'gzip' in response_headers.get('Content-Type'):
    response = gzip.GzipFile(fileobj=response)

context = etree.iterparse(response, events=("start", "end"))
context = iter(context)

_, root = next(context)
_, items_root = next(context)

try:
    for event, item in context:
        if event == "end" and item.tag == "item":
            title = item.find('title').text
            category = item.find('category').text
            city = item.find('city').text
            item.clear()
        items_root.clear()
        root.clear()
except Exception as e:
    raise (e)

以前，我只是清除item和root，但后来我以为items可能有很多空的item，这会导致内存泄漏。但是即使添加了它，问题也无法解决。

使用etree解析巨大的XML时防止内存泄漏

0 个答案: