Question

我的beautifulsoup对象中有一组html代码，可以用其他代码替换

这就是我在Beautifulsoup对象中获得的内容

<html>
<body>
<table class="bt" width="100%">
<tr class="heading">
<th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>

必填代码：

<html>
<body>
<table class="bt" width="100%">
<tr class="heading">
<th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>

我已经尝试但是没有用

soup.replace('<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>', '<th class="tho" scope="col"><b>O</b></th>')

Answer 1

在你自己的解决方案中，你已经暗示了字符串替换，而不是实际的HTML树插入。那是因为你开始的HTML很糟糕。

一种解决方案是将标签添加到BeautifulSoup生成的原始树中：

from bs4 import BeautifulSoup
import re

start_str = """<html><body><table class="bt" width="100%"><tr class="heading"><th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>"""
soup = BeautifulSoup(start_str) # remark: this'll split right after the first '</html>'
substr = re.findall('<th class="thm".*', start_str, re.DOTALL)
subsoup = BeautifulSoup(substr[0])
for tag in subsoup.findAll('th'):
    soup.tr.append(tag)

虽然不建议使用正则表达式来解析HTML，但这是一个边缘情况，它甚至没有真正解析，只是选择一个子字符串。从这个意义上讲，它甚至可以完全用纯python内置替换：

substr = start_str.split('</html></html>')[1]

另一个解决方案就是删除那些不需要的标记，但这只有在修复了这个子字符串时才有效：

to_remove = '</tr></table></div></div></div></div></div></div></body></html></html>'
soup = BeautifulSoup(''.join(start_str.split(to_remove)))

如果这些标签之间有空格，您也可以在此解决方案中使用re模块。

Beautifulsoup用不同的代码替换html代码集

1 个答案: