原始html

Question

我正在尝试将html解析为字典

我当前的代码中有很多逻辑。

闻起来很糟糕，我使用lxml来帮助我解析它。任何推荐的方法来解析那种没有太多格式良好的DOM的html？

非常感谢

原始html

<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>
<p><strong>Flight duration:</strong> 3h 45m</p>
<p><strong>Operated by:</strong> NokScoot</p>

预期结果

{
    Departs: "5:15:00AM, Sat, Nov 28, 2015",
    Arrives: "5:15:00AM, Sat, Nov 28, 2015",
    Flight duration: "3h 45m"
    ...
}

当前代码（实施）

doc_root = html.document_fromstring(resp.text)
for ele in doc_root.xpath('//ul[@class="tb_body"]'):
  if has_stops(ele.xpath('.//li[@class="tb_body_flight"]//span[@class="has_cuspopup"]')):
    continue 
  set_trace()
  from_city = ele.xpath('.//li[@class="tb_body_city"]')[0]
  set_trace()
  sub_ele = ele.xpath('.//li[@class="tb_body_flight"]//span[@class="has_cuspopup"]')
  set_trace()

Answer 1

我为您提供的html创建了示例。它使用了流行的Beautiful Soup。

from bs4 import BeautifulSoup


data = '<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>\
        <p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>\
        <p><strong>Flight duration:</strong> 3h 45m</p>\
        <p><strong>Operated by:</strong> NokScoot</p>'

soup = BeautifulSoup(data, 'html.parser')
res = {p.contents[0].text: p.contents[1].split(' - ')[0].strip() for p in soup.find_all('p')}
print(res)

输出：

{
    'Departs:': '5:15:00AM, Sat, Nov 28, 2015', 
    'Flight duration:': '3h 45m', 
    'Operated by:': 'NokScoot', 
    'Arrives:': '8:00:00AM, Sat, Nov 28, 2015'
}

我认为如果你想让代码紧凑，你应该避免使用属性。

我怎样才能以优雅的方式将html解析为字典

原始html

预期结果

当前代码（实施）

1 个答案: