Question

帮助请写xpath-expression。

HTML：

<div class="TabItem">
    <p><strong>Product Composition</strong></p>
    <p>93% Polyamide 7% Elastane</p>
    <p>Lining: 100% Polyester</p><p>Dress Length: 90 cm</p>

    <p><strong>Product Attributes;</strong></p>
    <p>: Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side</p>
    <p>Lining Type: Full Lining</p>
</div>

这需要获取以下html词典：

data['Product Composition'] = '93% Polyamide 7% Elastane Lining: 100% Polyester</p><p>Dress Length: 90 cm'
data['Product Attributes;'] = ': Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side Lining Type: Full Lining'

重要的是元素的数量可以变化。即你需要一个通用的解决方案

Answer 1

获取strong内的每个p标记，然后获取它的父级和下一个父级的兄弟姐妹，直到另一个p标记内部带有strong标记，或者不再留下兄弟姐妹：

from lxml.html import fromstring


html_data = """<div class="TabItem">
    <p><strong>Product Composition</strong></p>
    <p>93% Polyamide 7% Elastane</p>
    <p>Lining: 100% Polyester</p><p>Dress Length: 90 cm</p>

    <p><strong>Product Attributes;</strong></p>
    <p>: Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side</p>
    <p>Lining Type: Full Lining</p>
</div>"""

tree = fromstring(html_data)
data = {}
for strong in tree.xpath('//p/strong'):
    parent = strong.getparent()

    description = []
    next_p = parent.getnext()
    while next_p is not None and not next_p.xpath('.//strong'):
        description.append(next_p.text)
        next_p = next_p.getnext()

    data[strong.text] = " ".join(description)

print data

打印：

{'Product Composition': '93% Polyamide 7% Elastane Lining: 100% Polyester', 
 'Product Attributes;': ': Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side Lining Type: Full Lining'}

如何选择字典？

1 个答案: