用美丽的汤解析div儿童

时间:2016-02-02 00:27:00

标签: python html python-3.x beautifulsoup html-parsing

我使用漂亮的汤来查找和解析页面上的街道地址。 最后,我想将街道地址写入excel文档。

以下是我尝试解析的页面:https://montreal.lufa.com/en/pick-up-points

有问题的页面在类下面的同一级别列出了div元素。我一直无法解析各行。相反,我的代码只是吐出了类中的所有内容。

到目前为止我的代码:

from bs4 import BeautifulSoup
from urllib2 import urlopen
import urllib2

URL = "https://montreal.lufa.com/en/pick-up-points"
html = urllib2.urlopen(URL).read().decode('UTF-8')

soup = BeautifulSoup(html, "html5lib")

business = (soup.find('div', class_="info"))

print (business)

非常感谢任何帮助!

2 个答案:

答案 0 :(得分:1)

我会执行以下操作:对于每个商家,找到开放日并获取every previous sibling

for business in soup.find_all('div', class_="info"):
    days = business.find("div", class_="days")

    print(" ".join(sibling.get_text(strip=True) 
                   for sibling in reversed(days.find_previous_siblings())))

打印:

1600, René-Lévesque west 1600, René-Lévesque west Montreal, Quebec H3H 1P9
555 Chabanel Street West 555 Chabanel Street West Montreal, Quebec H2N 2H8
À la Boîte à Fleurs 3266 Saint-Rose Boulevard Laval, Quebec H7P 4K8
Allez Up Centre d'escalade 1555 St-Patrick Montreal, Quebec H3K 2B7
...
YMCA Cartierville 11885 Laurentien Boulevard Montreal, Quebec H4J 2R5
Zone, Real estate Agency 200 rue St-Jean Longueuil, Quebec J4H 2X5

答案 1 :(得分:1)

酷,alecxe!这就是我在机器上工作的原因。 。 。

#1)  In Console:  
pip install lxml


#2)  Run script below:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import urllib2

URL = "https://montreal.lufa.com/en/pick-up-points"
html = urllib2.urlopen(URL).read().decode('UTF-8')

soup = BeautifulSoup(html, "lxml")

#business = (soup.find('div', class_="info"))
for business in soup.find_all('div', class_="info"):
    days = business.find("div", class_="days")

    print(" ".join(sibling.get_text(strip=True) 
                   for sibling in reversed(days.find_previous_siblings())))
print (business)