在课堂上的br标签下抓取文字

时间:2019-02-28 09:09:53

标签: python-3.x beautifulsoup

我一直试图在此页面中抓取地址: https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=2

我很难在br标签下获得地址:enter image description here

我尝试过的

addresses = page_content.select(' .cbp-vm-address')[0]
address = addresses.get_text(' ', strip=True)
address = list(addresses.stripped_strings)

这并不能给我课堂上的一切

我也尝试过:

for br in page_content.findAll('br'):
   item = br.next_siblings
   item = list(item)
   print(item)

这给了我如下结果(摘录): [<br/>, <br/>, <br/>, <br/>, <br/>, '\n', <a href="/solutions">DigitalSolutions</a>, '\n', <a href="https://www.yellowpages.my/deal/results.php">Deals</a>, '\n', <a class="sign-up" href="https://www.yellowpages.my/profile/add.php">Sign Up</a>, '\n', <a class="sign-up" href="https://www.yellowpages.my/profile/login.php">Login</a>, '\n']

我如何获得地址?在这里抓取相对较新。

2 个答案:

答案 0 :(得分:0)

有趣。我实际上对此也有问题,但是通过在创建汤对象之前替换了原始字符串中的所有</br>标签来克服了这个问题:

import requests
from bs4 import BeautifulSoup

raw = requests.get('https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=2').text
raw = raw.replace("</br>", "")
soup = BeautifulSoup(raw, 'html.parser')
addresses = [x.text.strip().split("\r\n")[-1].strip() for x in soup.find_all("div", class_='cbp-vm-address')]

尽管如此,但我感觉它不是最好的解决方案,因为它在加载汤对象之前会对HTML进行了预处理,这对我来说并不是最佳实践。

答案 1 :(得分:0)

感谢Ohad的快速回复。我只想稍微改善以下答案:

import requests
from bs4 import BeautifulSoup

raw = requests.get("https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen=2")
# raw = raw.replace("</br>", "") # try not to do this
soup = BeautifulSoup(raw.content, "lxml") # instead, change "html.parser" to "lxml"
addresses = [x.text.strip().split("\r\n")[-1].strip() for x in soup.find_all("div", class_="cbp-vm-address")]