python beatifulsoup webscrape循环中的不同元素

时间:2018-11-11 20:33:59

标签: python web-scraping

我要抓取的表中存在循环

<ul>
<li class="cell036 tal arrow"><a href=" y/">ALdCTL</a></li>
<li class="cell009">5,71</li>
<li class="cell009">5,74</li>   
<li class="cell009">-3,04</li>   
<li class="cell009">5,92</li>   
<li class="cell009">5,76</li>   
<li class="cell009">5,53</li>   
<li class="cell009">907.438</li>   
<li class="cell009">5.114.192</li> 
</ul>

我的python代码可以在ali的第一个元素中查找文本,而不能在cell009类的第一个元素中查找文本

c=soup.findAll('li',class_='cell036 tal arrow' )

for foo in soup.find_all('li', class_= ['cell036 tal arrow']):

   bar = foo.find(['a'])
   print(bar.text)

3 个答案:

答案 0 :(得分:1)

要抓取所有值,您只需要获取所有li标记(而不必限制搜索类为cell036 tal arrow的元素,这就是为什么您只能获取该值):

尝试一下:

from bs4 import BeautifulSoup

html_text = """
<ul>
<li class="cell036 tal arrow"><a href=" y/">ALdCTL</a></li>
<li class="cell009">5,71</li>
<li class="cell009">5,74</li>
<li class="cell009">-3,04</li>
<li class="cell009">5,92</li>
<li class="cell009">5,76</li>
<li class="cell009">5,53</li>
<li class="cell009">907.438</li>
<li class="cell009">5.114.192</li>
</ul>
"""

soup = BeautifulSoup(html_text, "lxml")

for foo in soup.find_all('li'):

   print(foo.text)

输出:

ALdCTL
5,71
5,74
-3,04
5,92
5,76
5,53
907.438
5.114.192

答案 1 :(得分:1)

借用drec4s的开放结构,您也许还可以使用CSS或组合以类名作为目标li元素。

from bs4 import BeautifulSoup

html_text = """
<ul>
<li class="cell036 tal arrow"><a href=" y/">ALdCTL</a></li>
<li class="cell009">5,71</li>
<li class="cell009">5,74</li>
<li class="cell009">-3,04</li>
<li class="cell009">5,92</li>
<li class="cell009">5,76</li>
<li class="cell009">5,53</li>
<li class="cell009">907.438</li>
<li class="cell009">5.114.192</li>
</ul>
"""

soup = BeautifulSoup(html_text, "lxml")

for foo in soup.select('li.cell036.tal.arrow,li.cell009'):

   print(foo.text)

答案 2 :(得分:0)

您要查找的li内部 内不包含其他li元素。他们是兄弟姐妹。使用find_next_siblings

content = """
<ul>
<li class="cell036 tal arrow"><a href=" y/">ALdCTL</a></li>
<li class="cell009">5,71</li>
<li class="cell009">5,74</li>   
<li class="cell009">-3,04</li>   
<li class="cell009">5,92</li>   
<li class="cell009">5,76</li>   
<li class="cell009">5,53</li>   
<li class="cell009">907.438</li>   
<li class="cell009">5.114.192</li> 
</ul>
"""

soup = bs4.BeautifulSoup(content)
header = soup.findAll("li", class_="cell036 tal arrow")

header[0].find_next_siblings("li")

赠予:

[<li class="cell009">5,71</li>,
 <li class="cell009">5,74</li>,
 <li class="cell009">-3,04</li>,
 <li class="cell009">5,92</li>,
 <li class="cell009">5,76</li>,
 <li class="cell009">5,53</li>,
 <li class="cell009">907.438</li>,
 <li class="cell009">5.114.192</li>]