Python SAX Parser

时间:2016-05-19 22:18:15

标签: python xml parsing sax

请帮忙。我正在尝试解析大型XML文件并将数据传输到CSV文件中。我不断丢失标签之间的大量数据,无法弄清楚原因。

以下是一些XML:

<testcase internalid="1256092" name="hls_vtt_single_default_diable_vtt">
    <node_order><![CDATA[7]]></node_order>
    <externalid><![CDATA[6121]]></externalid>
    <version><![CDATA[2]]></version>
    <summary><![CDATA[<p>condition: single subtitle track is available in stream and it is default  &nbsp;set the vtt track to diable status before playing stream.</p>
<p>&nbsp;</p>
<div>play stream  no subtitle is rendered along with A/V<span class="Apple-tab-span" style="white-space:pre">   </span></div>
<div>&nbsp;</div>]]></summary>
    <preconditions><![CDATA[]]></preconditions>
    <execution_type><![CDATA[1]]></execution_type>
    <importance><![CDATA[2]]></importance>
</testcase>

这是我的Python代码:

class CaseHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.externalid = ""
      self.version = ""
      self.summary = ""

   def startElement(self, tag, attributes):
       self.CurrentData = tag
       if tag == "testcase":
           name = attributes["name"]
           outfile.write("\n" + name + " ,")

   def endElement(self, tag):
       if self.CurrentData == "externalid":
           outfile.write("OTV52-" + self.externalid + ",")

       elif self.CurrentData == "version":
           outfile.write("Version:  " + self.version + ",")

       elif self.CurrentData == "summary":
           outfile.write("Summary:  " + self.summary + ",")

   def characters(self, content):
      if self.CurrentData == "externalid":
         self.externalid = content
      elif self.CurrentData == "version":
         self.version = content
      elif self.CurrentData == "summary":
         self.summary = content

if ( __name__ == "__main__"):

   parser = xml.sax.make_parser()
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   Handler = CaseHandler()
   parser.setContentHandler( Handler )

   parser.parse("OTV52.xml")

问题是它不会返回“摘要”括号中的任何信息。 externalid和版本数据很好。但是从“摘要”括号返回的所有内容都是div括号。

我需要它返回:

“条件:单个字幕轨道在流中可用,并且在播放流之前默认将vtt轨道设置为diable状态。播放流没有字幕与A / V一起呈现”

1 个答案:

答案 0 :(得分:0)

正如此answer所示,您应该将解析后的值+=content与每个characters()调用连接起来。但是,要删除解析后的CDATA中的xml内容(包括换行符和空格),请考虑使用正则表达式替换:

import xml.sax
import re

class CaseHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.externalid = ""
      self.version = ""
      self.summary = ""

   def startElement(self, tag, attributes):
       self.CurrentData = tag
       if tag == "testcase":
           name = attributes["name"]
           outfile.write("\r" + name + " ,")

   def endElement(self, tag):
       if self.CurrentData == "externalid":
           outfile.write("OTV52-" + self.externalid + ",")

       elif self.CurrentData == "version":        
           outfile.write("Version:  " + self.version + ",")

       elif self.CurrentData == "summary":
           self.summary = re.sub("<[^>]+>", "", self.summary)
           self.summary = re.sub("\n|&nbsp;|/\s\s/", "", self.summary).strip()
           outfile.write("Summary:  " + self.summary + ",")

   def characters(self, content):
      if self.CurrentData == "externalid":
         self.externalid += content
      elif self.CurrentData == "version":
         self.version += content
      elif self.CurrentData == "summary":
         self.summary += content

输出(全部一行)

# 
# hls_vtt_single_default_diable_vtt ,OTV52-6121,Version:  2,Summary:          \
#          condition: single subtitle track is available in stream and it is  \
#          default  set the vtt track to diable status before playing         \
#          stream.play stream  no subtitle is rendered along with A/V,        \