在HTML标记之间提取文本

时间:2018-02-02 19:35:17

标签: python python-2.7

我有这个字符串:

In December 2011, Norway's largest online sex shop hemmelig.com was <a href="http://www.dazzlepod.com/hemmelig/?page=93" target="_blank" rel="noopener">hacked by a collective calling themselves &quot;Team Appunity&quot;</a>. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.

(不要问)

该字符串内部有一个到站点本身的HREF链接,我需要做的是在标记<a href=""></a>之间提取信息。所以最终结果应如下所示:

In December 2011, Norway's largest online sex shop hemmelig.com was hacked by a collective calling themselves &quot;Team Appunity&quot;. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.

到目前为止我能够做的是使用正则表达式匹配整个标记并将其替换为空白:

def get_unlinked_description(descrip):
    html_tag_regex = re.compile(r"<.+>", re.I)
    return html_tag_regex.sub("", descrip)

然而,正如您所期望的那样,输出会删除整个字符串:

In December 2011, Norway's largest online sex shop hemmelig.com was . The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes

如何在不删除完整字符串的情况下成功提取标记之间的信息以及删除标记?

1 个答案:

答案 0 :(得分:0)

您可能正在寻找Beautiful Soup

至于你的实施。使用的代码是:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

soup.href.string

html_doc将是您要解析的字符串或文档,'html.parser'是您希望运行的python命令。

这应该最终返回In December 2011, Norway's largest online sex shop hemmelig.com was hacked by a collective calling themselves &quot;Team Appunity&quot;. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.