Question

让我们考虑以下HTML代码段：

html = '''
 <p>
  The chairman of European Union leaders, Donald Tusk, will meet May in London on Thursday, a day after the bloc’s Brexit negotiator weakened sterling by issuing another warning to Britain, which is due to leave the bloc in March 2019.
 </p>
'''

让它变成一个BeautifulSoup对象：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

我想转换该汤对象，使其HTML输出为：

'''
    <p>
      The chairman of European Union leaders, <span style="color : red"> Donald Tusk </span>, will meet May in London on Thursday, a day after the bloc’s Brexit negotiator weakened sterling by issuing another warning to Britain, which is due to leave the bloc in March 2019.
     </p>
'''

我在the doc page of BeautifulSoup上找到了几个示例，这些示例如何替换字符串，创建新标签，甚至在树中的特定位置插入新标签，但是在我的用例中在字符串中间添加新标签。

任何帮助都非常欢迎。

Answer 1

首先，我要说谢谢您发布这个问题，因为这是一个非常有趣的编码问题。

我花了一些时间研究这个问题，最终决定给出答案。

我尝试使用insert_before()中的insert_after()和BeautifulSoup来修改示例HTML中的<p>标签。我还研究了使用extend()中的append()和BeautifulSoup。经过数十次尝试，我只是无法获得您要求的结果。

以下代码似乎可以根据关键字（例如Donald Tusk）完成请求的HTML修改。我使用了replace_with() BeautifulSoup中的内容，将HTML中的原始标记替换为new_tag()中的BeautifulSoup.

该代码有效，但是我敢肯定它可以改进。

from bs4 import BeautifulSoup

raw_html = """
<p> This is a test. </p>
<p>The chairman of European Union leaders, Donald Tusk, will meet May in London on Thursday, a day after the bloc’s Brexit negotiator weakened sterling by issuing another warning to Britain, which is due to leave the bloc in March 2019.</p>
<p> This is also a test. </p>
"""

soup = BeautifulSoup(raw_html, 'lxml')

# find the tag that contains the keyword Donald Tusk
original_tag = soup.find('p',text=re.compile(r'Donald Tusk'))

if original_tag:
  # modify text in the tag that was found in the HTML
  tag_to_modify = str(original_tag.get_text()).replace('Donald Tusk,', '<span style="color:red">Donald Tusk</span>,')

  print (tag_to_modify)
  # outputs
  The chairman of European Union leaders, <span style="color:red">Donald Tusk</span>, will meet May in London on Thursday, a day after the bloc’s Brexit negotiator weakened sterling by issuing another warning to Britain, which is due to leave the bloc in March 2019.

  # create a new <p> tag in the soup
  new_tag = soup.new_tag('p')

  # add the modified text to the new tag
  # setting a tag’s .string attribute replaces the contents with the new string
  new_tag.string = tag_to_modify

  # replace the original tag with the new tag
  old_tag = original_tag.replace_with(new_tag)

  # formatter=None, BeautifulSoup will not modify strings on output
  # without this the angle brackets will get turned into “&lt;”, and “&gt;”
  print (soup.prettify(formatter=None))
  # outputs 
  <html>
    <body>
      <p>
        This is a test.
      </p>
      <p>
        The chairman of European Union leaders, <span style="color:red">Donald Tusk</span>, will meet May in London on Thursday, a day after the bloc’s Brexit negotiator weakened sterling by issuing another warning to Britain, which is due to leave the bloc in March 2019.
      </p>
      <p>
        This is also a test.
      </p>
    </body>
  </html>

Answer 2

尝试使用循环，遍历字符串中的每个单词，找到要查找的字符串（使用任何可行的方法，正则表达式将很有用），然后使用 Tag.insert（position，“ found_word “）

Answer 3

您需要使用正则表达式。希望这段代码对您有所帮助。

import re

def highlight_matches(query, text):
    def span_matches(match):
        html = '<span style="color : red">{0}</span>'
        return html.format(match.group(0))
    return re.sub(query, span_matches, text, flags=re.I)

用标签替换字符串中的单词

3 个答案: