Question

我在HTML单元格中有一些HTML表格，如下所示：

miniTable='<table style="width: 100%%" bgcolor="%s">
               <tr><td><font color="%s"><b>%s</b></td></tr>
           </table>' % ( bgcolor, fontColor, floatNumber)

html += '<td>' + miniTable + '</td>'

有没有办法删除与此mintml相关的HTML标记，仅这些html标记？
我想以某种方式删除这些标签：

<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>
and
</b></td></tr></table>

得到这个：

floatNumber

其中floatNumber是浮点数的字符串表示形式。 我不希望以任何方式修改任何其他HTML标记。我正在考虑使用string.replace或regex，但我很难过。

Answer 1

Do not use str.replace or regex.

使用像Beautiful Soup这样的html解析库，获取所需的元素和包含的文本。

最终代码看起来应该是这样的

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

for t in soup.find_all("table"): # the actual selection depends on your specific code
    content = t.get_text()
    # content should be the float number

Answer 2

如果你不能安装和使用美丽的汤（否则BS是首选的，因为@ otto-allmendinger建议）：

import re
s = '<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>1.23</b></td></tr></table>'
result = float(re.sub(r"<.?table[^>]*>|<.?t[rd]>|<font[^>]+>|<.?b>", "", s))

使用python删除特定的html标记

2 个答案: