Question

我无法使用Beautifulsoup分析具有“ div”属性的HTML元素。代码看起来像这样

我需要提取h4标签中的内容，因为它是一个随机值，所以我无法搜索“ Ocarrol”文本

find('div',{"class": "carResultRow_OfferInfo_Supplier-wrap "})

进行查询时，我返回None

<div class="carResultRow_OfferInfo_Supplier-wrap ">
<h3 class="carResultRow_OfferInfo_SupplierLabel">Servicio proporcionado por:</h3>
<img src="https://cdn2.rcstatic.com/images/suppliers/flat/ocarrol_logo.gif" title="Ocarrol" alt="Ocarrol">
<h4 style="" xpath="1">Ocarrol</h4>
<a href="InfoPo=0&amp;driversAge=30&amp;os=1" onclick="GAQPush('cboxElement">Términos y condiciones</a>
</div>

link

添加链接，在这种情况下，我只需要汽车公司的名称，例如Ocarrol，Ocarrol，Hertz，Fit Car Rental..etc

Answer 1

我认为您正在使用BeautifulSoup 4.7+。某些属性在《美丽的汤》中处理得有些特殊，在4.7中，最终结果与<= 4.6中的最终结果略有不同。

通常以空格分隔的列表处理的属性与所有其他属性的处理方式有所不同。 class恰好是这些属性之一，通常以空格分隔的列表进行处理。 BeautifulSoup实际上不是将这些属性存储在HTML文档中，而是将它们存储为类列表（已删除空格）："class1 class2 "-> ['class1', 'class2']。当需要将class属性评估为字符串时，它会重新组合连接每个值的类，并在每个值之间使用一个空格，但是会注意到诸如尾随空格之类的内容不再存在："class1 class2"。

现在，我不是在争论这是一件直观的事情，而仅仅是BeautifulSoup所做的。我个人更希望BeautifulSoup将它们存储为原始字符串，然后在需要时将它们拆分为一个列表，但这不是它们的作用。

现在在BeautifulSoup <= 4.6中，我认为保留了尾随空间，但还有许多其他怪癖。但是对于您的4.7+版本，您只需要假设尾随空格和前导空格将被忽略，并且空格将被折叠为类之间的单个空格。因此，在您的情况下，只需保留尾随空格即可。

soup.find('div',{"class": "carResultRow_OfferInfo_Supplier-wrap"})

您可以在此处了解有关此行为的更多信息：https://bugs.launchpad.net/beautifulsoup/+bug/1824502。

示例

from bs4 import BeautifulSoup

html = """
<div class="carResultRow_OfferInfo_Supplier-wrap ">
<h3 class="carResultRow_OfferInfo_SupplierLabel">Servicio proporcionado por:</h3>
<img src="https://cdn2.rcstatic.com/images/suppliers/flat/ocarrol_logo.gif" title="Ocarrol" alt="Ocarrol">
<h4 style="" xpath="1">Ocarrol</h4>
<a href="InfoPo=0&amp;driversAge=30&amp;os=1" onclick="GAQPush('cboxElement">Términos y condiciones</a>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

print(soup.find('div',{"class": "carResultRow_OfferInfo_Supplier-wrap"}).find('h4'))

输出

<h4 style="" xpath="1">Ocarrol</h4>

Answer 2

也许您可以使用CSS选择器代替find？

from bs4 import BeautifulSoup

html = '''<div class="carResultRow_OfferInfo_Supplier-wrap ">
<h3 class="carResultRow_OfferInfo_SupplierLabel">Servicio proporcionado por:</h3>
<img src="https://cdn2.rcstatic.com/images/suppliers/flat/ocarrol_logo.gif" title="Ocarrol" alt="Ocarrol">
<h4 style="" xpath="1">Ocarrol</h4>
<a href="InfoPo=0&amp;driversAge=30&amp;os=1" onclick="GAQPush('cboxElement">Términos y condiciones</a>
</div>'''
soup = BeautifulSoup(html, 'lxml')

print(soup.select('div[class="carResultRow_OfferInfo_Supplier-wrap"]'))

打印：

[<div class="carResultRow_OfferInfo_Supplier-wrap">
<h3 class="carResultRow_OfferInfo_SupplierLabel">Servicio proporcionado por:</h3>
<img alt="Ocarrol" src="https://cdn2.rcstatic.com/images/suppliers/flat/ocarrol_logo.gif" title="Ocarrol"/>
<h4 style="" xpath="1">Ocarrol</h4>
<a href="InfoPo=0&amp;driversAge=30&amp;os=1" onclick="GAQPush('cboxElement">Términos y condiciones</a>
</div>]

包含在div类中的标签h4

2 个答案: