反序列化xml到对象的问题 - 不需要的特殊字符拆分

时间:2011-09-26 10:01:58

标签: python xml encoding deserialization

我尝试将xml反序列化为对象,我遇到了xml树中各种项目编码的问题。

XML示例:

<?xml version="1.0" encoding="utf-8"?>
<results>
  <FlightTravel>
    <QuantityOfPassengers>6</QuantityOfPassengers>
    <Id>N5GWXM</Id>
    <InsuranceId>330992</InsuranceId>
    <TotalTime>3h 00m</TotalTime>
    <TransactionPrice>540.00</TransactionPrice>
    <AdditionalPrice>0</AdditionalPrice>
    <InsurancePrice>226.56</InsurancePrice>
    <TotalPrice>9561.31</TotalPrice>
    <CompanyName>XXXXX</CompanyName>
    <TaxID>111-11-11-111</TaxID>
    <InvoiceStreet>Jagiellońska</InvoiceStreet>
    <InvoiceHouseNo>8</InvoiceHouseNo>
    <InvoiceZipCode>Jagiellońska</InvoiceZipCode>
    <InvoiceCityName>Warszawa</InvoiceCityName>
    <PayerStreet>Jagiellońska</PayerStreet>
    <PayerHouseNo>8</PayerHouseNo>
    <PayerZipCode>11-111</PayerZipCode>
    <PayerCityName>Warszawa</PayerCityName>
    <PayerEmail>no-reply@xxxx.pl</PayerEmail>
    <PayerPhone>123123123</PayerPhone>
    <Segments>
      <Segment0>
        <DepartureAirport>WAW</DepartureAirport>
        <DepartureDate>śr. 06 lip</DepartureDate>
        <DepartureTime>07:50</DepartureTime>
        <ArrivalAirport>VIE</ArrivalAirport>
        <ArrivalDate>śr. 06 lip</ArrivalDate>
        <ArrivalTime>09:15</ArrivalTime>
      </Segment0>
      <Segment1>
        <DepartureAirport>VIE</DepartureAirport>
        <DepartureDate>śr. 06 lip</DepartureDate>
        <DepartureTime>10:00</DepartureTime>
        <ArrivalAirport>SZG</ArrivalAirport>
        <ArrivalDate>śr. 06 lip</ArrivalDate>
        <ArrivalTime>10:50</ArrivalTime>
      </Segment1>
    </Segments>
  </FlightTravel>
</results>

python中的XML反序列化功能:

# -*- coding: utf-8 -*-

from lxml import etree
import codecs

class TitleTarget(object):
    def __init__(self):
        self.text = []
    def start(self, tag, attrib):
        self.is_title = True #if tag == 'Title' else False
    def end(self, tag):
        pass
    def data(self, data):
        if self.is_title:
            self.text.append(data)
    def close(self):
        return self.text

parser = etree.XMLParser(target = TitleTarget())

infile = 'Flights.xml'
results = etree.parse(infile, parser)

out = open('wynik.txt', 'w')
out.write('\n'.join(results))
out.close()

输出:

['6','N5GWXM','330992','3h 00m','540.00','0','226.56','9561.31','XXXXX','111-11-11-111' ,'Jagiello',''','ska','8','Jagiello',''','ska','Warszawa','Jagiello',''','ska','8',' 11-111','Warszawa','no-reply@xxxx.pl','123123123','WAW','ś','r。 06唇','07:50','VIE',''','r。 06唇','09:15','VIE','ś','r。 06唇','10:00','SZG','ś','r。 06唇','10:50']

项目'Jagiellońska'是特殊字符'ñ'。当解析器将数据附加到数组时,char'n'是分裂字符的王者,我的问题是为什么会发生这种情况?其余项目正确附加到数组。在项目'śr06.lip'中情况完全相同。

1 个答案:

答案 0 :(得分:1)

问题是每个元素可能会多次调用目标类的data方法。例如,如果馈线穿过块边界,则可能发生这种情况。看起来它也会在遇到非ASCII字符时发生。这是古老的传说。我无法找到记录的位置。但是,如果您将目标类更改为类似以下内容,它将起作用。我已根据您的数据对其进行了测试。

class TitleTarget(object):
    def __init__(self):
        self.text = []
    def start(self, tag, attrib):
        self.is_title = True #if tag == 'Title' else False
        if self.is_title:
            self.text.append(u'')
    def end(self, tag):
        pass
    def data(self, data):
        if self.is_title:
            self.text[-1] += data
    def close(self):
        return self.text

为了更好地掌握输出结果,请在解析后调用print repr(results)。您现在应该看到这样的未分割文本片段

u'Jagiello\u0144ska\n    '
u'\u015br. 06 lip\n        '