Beautifulsoup:循环使用元素来获取文本

时间:2015-07-19 02:25:14

标签: python web-scraping beautifulsoup html-parsing

我正在学习BeautifulSoup,并且网页的内容类似于:

HTML:

<div>
 <table>
 <tr>
  <td>
   <div>
     <a name='abc'>....</a>
   </div>
  </td>
 </tr>
</table>
</div>
<a name='pqr'>...</a> 
<div>text1</div>
<div>text2</div>
<div>text3</div>
 <a name='mno'>...</a> 

<div>
 <table>
 <tr>
  <td>
   <div>
     <a name='xyz'>....</a>
   </div>
  </td>
 </tr>
</table>
</div>

预期结果:

<a name='pqr'>...</a> 
<div>text1</div>
<div>text2</div>
<div>text3</div> 
<a name='mno'>...</a>

我的意思是,在达到'a name ='xyz''标签之前获取所有内容

2 个答案:

答案 0 :(得分:0)

您可以make a function使所有div个元素都拥有以前的兄弟pqr链接和下一个兄弟mno链接:

def desired_divs(elm):
    if elm and elm.name == "div" and \
            elm.find_previous_sibling("a", {"name": "pqr"}) and \
            elm.find_next_sibling("a", {"name": "mno"}):
        return elm

for div in soup.find_all(desired_divs):
    print(div.text)

打印:

text1
text2
text3

或者,您可以找到开始的a元素,然后迭代所有后续元素,并在途中点击收集a文本的结尾div元素时停止:

beginning = soup.find("a", {"name": "pqr"})
for elm in beginning.find_next_siblings():
    if elm.name == "a" and elm.get("name") == "mno":
        break

    print elm.text

答案 1 :(得分:0)

我试过这个并且有效:

 aref=soup.find('a',{"name": "abc"})

 for i in aref.findAllNext(): 
    if(i.attrs=={'name': 'xyz'}):
       break
    else:
       print(i.text)