美丽的汤:没有抓住正确的信息

时间:2015-12-10 23:52:50

标签: python django beautifulsoup

我正在使用美丽的汤来刮掉大胆的花名和相应的图片链接:http://www.all-my-favourite-flower-names.com/list-of-flower-names.html

我想这样做不只是以“A”开头的花朵,而是让刮板适用于你可以尝试获得的所有其他花朵(花朵以“B”,“C”,“D”开头等)。

我能够将某些“A”花的东西混在一起......

for flower in soup.find_all('b'):  #Finds flower names and appends them to the flowers list
        flower = flower.string
        if (flower != None and flower[0] == "A"):
            flowers.append(flower.strip('.()'))

    for link in soup.find_all('img'):  #Finds 'src' in <img> tag and appends 'src' to the links list
        links.append(link['src'].strip('https://'))

    for stragler in soup.find_all('a'):  #Finds the only flower name that doesn't follow the pattern of the other names and inserts it into flowers list
        floss = stragler.string
        if floss != None and floss == "Ageratum houstonianum.":
            flowers.insert(3, floss)

这个问题的一个显而易见的问题是,当任何变化时,它肯定会破裂。有人可以帮我一把吗?

1 个答案:

答案 0 :(得分:1)

问题似乎是花朵已跨页面分页。 这样的东西应该可以帮助你遍历不同的页面。 CODE未经过测试

import urllib2
test = {'A':'', 'B':'-B', 'XYZ': '-X-Y-Z'}
flower_list = []
for key, value in test.items():
     page = urllib2.urlopen('http://www.all-my-favourite-flower-names.com/list-of-flower-names{0}.html'.format(
value)).read()
     soup = BeautifulSoup(page)
     # Now do your logic or every page, and probably save the flower names in a list.