Question

我在处理大文件时总是遇到堆内存问题。我正在处理9 GB xml文件。

这是我的代码。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DatosAbonados xmlns="http://www.cnmc.es/DatosAbonados">
    <DatosAbonado Operacion="1" FechaExtraccion="2015-10-08">
        <Titular>
            <PersonaJuridica DocIdentificacionJuridica="A84619488" RazonSocial="HERMANOS ROJAS" NombreComercial="PINTURAS ROJAS"/>
        </Titular>
        <Domicilio Escalera=" " Piso=" " Puerta=" " TipoVia="AVENIDA" NombreVia="MANOTERAS" NumeroCalle="10" Portal=" " CodigoPostal="28050" Poblacion="Madrid" Provincia="28"/>
        <NumeracionAbonado>
            <Rangos NumeroDesde="211188600" NumeroHasta="211188699" ConsentimientoGuias-Consulta="1" VentaDirecta-Publicidad="1" ModoPago="1">
                <Operador RazonSocial="11888 SERVICIO CONSULTA TELEFONICA S.A." DocIdentificacionJuridica="A83519389"/>
            </Rangos>
        </NumeracionAbonado>
    </DatosAbonado>
    <DatosAbonado Operacion="1" FechaExtraccion="2015-10-08">
        <Titular>
            <PersonaJuridica DocIdentificacionJuridica="A84619489" RazonSocial="HERMANOS RUBIO" NombreComercial="RUBIO PELUQUERIAS"/>
        </Titular>
        <Domicilio Escalera=" " Piso=" " Puerta=" " TipoVia="AVENIDA" NombreVia="BURGOS" NumeroCalle="18" Portal=" " CodigoPostal="28036" Poblacion="Madrid" Provincia="28"/>
        <NumeracionAbonado>
            <Rangos NumeroDesde="211186000" NumeroHasta="211186099" ConsentimientoGuias-Consulta="1" VentaDirecta-Publicidad="1" ModoPago="1">
                <Operador RazonSocial="11888 SERVICIO CONSULTA TELEFONICA S.A." DocIdentificacionJuridica="A83519389"/>
            </Rangos>
        </NumeracionAbonado>
    </DatosAbonado>
</DatosAbonados>

我在一段时间后在迭代中遇到堆内存问题。请帮我写出优化代码。

注意：服务器有3 GB的堆空间。我无法增加服务器空间。我正在使用以下参数执行 - -Xms1024m -Xmx3g

我的xml看起来像这样。

public class Cmt {
    private List<DetailInfo> details;

    public List<DetailInfo> getDetails() {
        return details;
    }
    public void setDetails(DetailInfo detail) {
        if(details == null){
            details = new ArrayList<DetailInfo>();
        }
        this.details.add(detail);
    }
}

我的Cmt课程是：

if (startElement.getName().getLocalPart().equals("DatosAbonado")) {
                    detailInfo = new DetailInfo();

                    Iterator<Attribute> attributes = startElement.getAttributes();
                    while (attributes.hasNext()) {
                        Attribute attribute = attributes.next();
                         if(attribute.getName().toString().equals("Operacion")){
                            detailInfo.setOperacion(attribute.getValue());
                        }
                    }
                }
if (event.isEndElement()) {
                EndElement endElement = event.asEndElement();
                if (endElement.getName().getLocalPart().equals("DatosAbonado")) {
                    Cmt cmt = null;
                    if(mapCmt.keySet().contains(identificador)){
                        cmt = mapCmt.get(identificador);
                    } else{
                        cmt = new Cmt();
                    }
                    cmt.setDetails(detailInfo);
                    mapCmt.put(identificador, cmt);
}
}

实际上Cmt对象非常少，但我有DetailInfo对象每个元素。如此巨大的没有。 DetailInfo对象是创建

我的逻辑是：

savetxt

Answer 1

问题的根源很可能就是：

mapCmt.put(someKey, cmt);

您正在使用许多大Cmt个对象填充哈希映射。您需要执行以下操作之一：

立即处理数据，而不是将其保存在数据结构中。
将数据写入数据库以供以后查询。
增加堆大小。
为您的数据找出较少的“内存饥渴”表示。

最后两种方法虽然没有扩展。当您增加输入文件的大小时，您将需要逐渐增加内存......直到最终超出执行平台的内存容量。

Answer 2

DatosAbonnado确实是杀手锏。如果你有足够的时间，这将导致你的应用程序窒息。

这种方法根本无法扩展。正如Stephan C所指出的，您需要在DatosAbonnado到达时处理它，而不是将它们收集在容器中。

由于这是我开发LDX +代码生成器的典型场景，因此我采用了以下步骤：

使用以下方法从XML创建XML Schema文件（因为您没有提供）：https://devutilsonline.com/xsd-xml/generate-xsd-from-xml
使用LDX +生成代码

此代码生成器实际上使用SAX，生成的代码允许您：

将complexElements序列化为Java对象
配置如何在运行时处理1对多关系（如此处的关系）

我在这里上传了代码：https://bitbucket.org/lolkedijkstra/ldx-samples 要查看代码，请导航到Source文件夹。在那里你会找到DatosAbonnados。

这种方法确实很好地扩展（内存消耗是平坦的）

如何使用STAX api处理大型XML文件（9 GB）

2 个答案: