如何选择字典?

时间:2014-03-10 17:18:03

标签: python html xpath html-parsing lxml

帮助请写xpath-expression。

HTML:

<div class="TabItem">
    <p><strong>Product Composition</strong></p>
    <p>93% Polyamide 7% Elastane</p>
    <p>Lining: 100% Polyester</p><p>Dress Length: 90 cm</p>

    <p><strong>Product Attributes;</strong></p>
    <p>: Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side</p>
    <p>Lining Type: Full Lining</p>
</div>

这需要获取以下html词典:

data['Product Composition'] = '93% Polyamide 7% Elastane Lining: 100% Polyester</p><p>Dress Length: 90 cm'
data['Product Attributes;'] = ': Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side Lining Type: Full Lining'

重要的是元素的数量可以变化。即你需要一个通用的解决方案

1 个答案:

答案 0 :(得分:1)

获取strong内的每个p标记,然后获取它的父级和下一个父级的兄弟姐妹,直到另一个p标记内部带有strong标记,或者不再留下兄弟姐妹:

from lxml.html import fromstring


html_data = """<div class="TabItem">
    <p><strong>Product Composition</strong></p>
    <p>93% Polyamide 7% Elastane</p>
    <p>Lining: 100% Polyester</p><p>Dress Length: 90 cm</p>

    <p><strong>Product Attributes;</strong></p>
    <p>: Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side</p>
    <p>Lining Type: Full Lining</p>
</div>"""

tree = fromstring(html_data)
data = {}
for strong in tree.xpath('//p/strong'):
    parent = strong.getparent()

    description = []
    next_p = parent.getnext()
    while next_p is not None and not next_p.xpath('.//strong'):
        description.append(next_p.text)
        next_p = next_p.getnext()

    data[strong.text] = " ".join(description)

print data

打印:

{'Product Composition': '93% Polyamide 7% Elastane Lining: 100% Polyester', 
 'Product Attributes;': ': Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side Lining Type: Full Lining'}