How to add space around removed tags in BeautifulSoup

时间:2015-06-30 13:50:56

标签: python html beautifulsoup html-parsing

from BeautifulSoup import BeautifulSoup

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''


soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)

I have this sample code and i cant find how to add spaces around the removed tags so when the text inside the <a href...> get formatted it can be readable and wont display like this:

PoemThe RavenOnce upon a midnight dreary, while I pondered, weak and weary...

In the greenest of our valleys By good angels tenanted..., part ofThe Haunted Palace

3 个答案:

答案 0 :(得分:2)

一种选择是找到所有文本节点并用空格连接它们:

" ".join(item.strip() for item in poems.find_all(text=True))

此外,您正在使用已过时且未维护的beautifulsoup3。升级到beautifulsoup4

pip install beautifulsoup4

并替换:

from BeautifulSoup import BeautifulSoup

使用:

from bs4 import BeautifulSoup

答案 1 :(得分:1)

get_text()中的

beautifoulsoup4具有可选输入separator。您可以按以下方式使用它:

soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')

答案 2 :(得分:0)

此处使用及其xpath函数替代搜索所有文本节点:

from lxml import etree

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''

root = etree.fromstring(html, etree.HTMLParser())
print(' '.join(root.xpath("//text()")))

它产生:

Poem  The Raven Once upon a midnight dreary, while I pondered, weak and weary...  


In the greenest of our valleys By good angels tenanted..., part of The Haunted Palace