“清理”HTML文本的最佳方式

时间:2015-08-21 03:28:33

标签: python

我有以下文字:

"It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth, 
nature, diversity, and history &#8211; all inside the prison of 
your mind! Where else can you..."

我想要做的是删除html标签并将其编码为unicode。我现在正在做:

def remove_tags(text):
    return TAG_RE.sub('', text)

仅剥离标签。我如何正确编码上面的数据库存储?

1 个答案:

答案 0 :(得分:2)

您可以尝试通过HTML解析器传递文本。以下是使用BeautifulSoup的示例:

from bs4 import BeautifulSoup

text = '''It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth, 
nature, diversity, and history &#8211; all inside the prison of 
your mind! Where else can you...'''

soup = BeautifulSoup(text)

>>> soup.text
u"It's the show your only friend and pastor have been talking about! \nWonder Showzen is a hilarious glimpse into the black \nheart of childhood innocence! Get ready as the complete first season of MTV2's Wonder Showzen tackles valuable life lessons like birth, \nnature, diversity, and history \u2013 all inside the prison of \nyour mind! Where else can you..."

您现在拥有一个unicode字符串,其HTML实体已转换为unicode转义字符,即&#8211;已转换为\u2013

这也会删除HTML标记。