如何清除此请求结果?

时间:2019-05-14 14:14:59

标签: python python-3.x string filter python-requests

我正在使用Web API(film API)。当我使用对某个URL的请求发出发布请求时,得到以下响应:

<a href='\"https:\/\/xdede.co\/peliculas\/p284052-ver-doctor-strange-online\"' up-target='\"body\"'>\n\t\t\t\t\t\t
<div class='\"SPoster\"'>\n\t\t\t\t\t\t\t
<img src='\"https:\/\/image.tmdb.org\/t\/p\/w45\/7OpmunCEZo93nyRIbx59QRaFvZz.jpg\"'/>\n\t\t\t\t\t\t&lt;\/div&gt;\n\t\t\t\t\t\t
<h2>Doctor Strange&lt;\/h2&gt;\n\t\t\t\t\t\t<span>Pelicula&lt;\/span&gt;\n\t\t\t\t\t&lt;\/a&gt;\n\t\t\t\t&lt;\/div&gt;\n\t\t\t\t"}</span>
</h2></div></a>

如何过滤此混乱以获取hrefh2标签?我已经尝试过beautifulsoup,但没有尝试。有什么建议吗?

1 个答案:

答案 0 :(得分:1)

使用BeautifulSoupregex

import re

import bs4 as bs4

html = """<a href='\"https:\/\/xdede.co\/peliculas\/p284052-ver-doctor-strange-online\"' up-target='\"body\"'>\n\t\t\t\t\t\t<div class='\"SPoster\"'>\n\t\t\t\t\t\t\t<img src='\"https:\/\/image.tmdb.org\/t\/p\/w45\/7OpmunCEZo93nyRIbx59QRaFvZz.jpg\"'/>\n\t\t\t\t\t\t&lt;\/div&gt;\n\t\t\t\t\t\t<h2>Doctor Strange&lt;\/h2&gt;\n\t\t\t\t\t\t<span>Pelicula&lt;\/span&gt;\n\t\t\t\t\t&lt;\/a&gt;\n\t\t\t\t&lt;\/div&gt;\n\t\t\t\t"}</span></h2></div></a>"""
soup = bs4.BeautifulSoup(html, features='html.parser')

href = re.sub(r'[\\"]', '', soup.a['href'])
h2 = re.sub(r'<[^>]*>', '', soup.a.h2.text)
h2 = ' '.join(re.findall(r'(\w+)', h2))

print(href)
print(h2)

输出:

https://xdede.co/peliculas/p284052-ver-doctor-strange-online
Doctor Strange Pelicula