Question

我正在使用Python + xPath来解析一些HTML，但我在解析定义列表时遇到了问题。一个例子如下：

<dl> <dt>Section One</dt> <dd>Child one</dd> <dd>Child one.2</dd> <dt>Section Two</dt> <dd>Child two</dd> </dl>

我想将其转换为如下输出：
{'Section One' : ['Child one','Child one.2'], 'Section Two' : ['Child two']}

我遇到困难，因为结构的方式，你在输出中找不到相同的层次结构。

由于

Answer 1

没有xpath的解决方案，使用lxml（如果你使用的是xpath，你可能已经使用过了吗？）：

from collections import defaultdict
from lxml import etree

dl = etree.fromstring('''<dl>
<dt>Section One</dt>
<dd>Child one</dd>
<dd>Child one.2</dd>
<dt>Section Two</dt>
<dd>Child two</dd>
</dl>''')

result = defaultdict(list)
for dt in dl.findall('dt'):
    for child in dt.itersiblings(): # iterate over following siblings
        if child.tag != 'dd':
            break # stop at the first element that is not a dd
        result[dt.text].append(child.text)

print dict(result)

（我能想到的任何xpath解决方案都比这更糟糕了）

Answer 2

单表达式XPath 1.0解决方案（如果可能的话）难以编写和理解。

这是一个简单的XSLT 1.0解决方案：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>

 <xsl:key name="kFollowing" match="dd"
      use="generate-id(preceding-sibling::dt[1])"/>

 <xsl:template match="dl">
  { <xsl:apply-templates select="dt"/> }
 </xsl:template>

 <xsl:template match="dt">
  <xsl:text/>'<xsl:value-of select="."/>' : [ <xsl:text/>
   <xsl:apply-templates select=
       "key('kFollowing', generate-id())"/>
   <xsl:text> ]</xsl:text>
   <xsl:if test="not(position()=last())">, </xsl:if>
 </xsl:template>

 <xsl:template match="dd">
  <xsl:text/>'<xsl:value-of select="."/>'<xsl:text/>
   <xsl:if test="not(position()=last())">, </xsl:if>
 </xsl:template>
</xsl:stylesheet>

在提供的XML文档上应用此转换时：

<dl>
    <dt>Section One</dt>
    <dd>Child one</dd>
    <dd>Child one.2</dd>
    <dt>Section Two</dt>
    <dd>Child two</dd>
</dl>

产生了想要的正确结果：

  { 'Section One' : [ 'Child one', 'Child one.2' ], 'Section Two' : [ 'Child two' ] }

解释：定义xsl:key并用于捕获1 - ＆gt; dt与紧随其后的兄弟dt元素之间的许多关系。

使用XPath解析定义列表的最佳方法是什么？

2 个答案: