从div中删除标签

时间:2015-03-19 21:31:58

标签: python beautifulsoup

我在Python中有一个简单的代码:

from bs4 import BeautifulSoup
import urllib2

webpage = urllib2.urlopen('http://fakepage.html')
soup = BeautifulSoup(webpage)

for anchor in soup.find_all("div", id="description"):
    print anchor

我几乎得到了我想要的东西,但在<div id=description></div>之间我得到了很多标签:

<div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>

我想只获取<div id=description></div>之间的文字(不是标签)来计算单词。 BeautifulSoup中有任何功能可以帮助我吗?

1 个答案:

答案 0 :(得分:2)

使用element.get_text() method获取 文字

for anchor in soup.find_all("div", id="description"):
    print anchor.get_text()

您可以传递strip=True以删除额外的空格,第一个参数用于连接剥离的字符串:

for anchor in soup.find_all("div", id="description"):
    print anchor.get_text(' ', strip=True)

演示:

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>
... '''
>>> soup = BeautifulSoup(sample)
>>> for anchor in soup.find_all("div", id="description"):
...     print anchor.get_text()
... 
some text to show  lots of  useless tags 
>>> for anchor in soup.find_all("div", id="description"):
...     print anchor.get_text(' ', strip=True)
... 
some text to show lots of useless tags