如何在某些标签之间获取文本和替换文本

时间:2014-06-24 16:14:36

标签: python html regex html-parsing

给出类似

的字符串
"<p> >this line starts with an arrow <br /> this line does not </p>"

"<p> >this line starts with an arrow </p> <p> this line does not </p>"

如何找到以箭头开头的行并用div

包围它们

这样就变成了:

"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>

3 个答案:

答案 0 :(得分:6)

由于它是您正在解析的HTML,因此请使用该工具进行工作 - 一个HTML解析器,如BeautifulSoup

使用find_all()查找以>开头的所有文本节点,并使用新的div标记wrap()

from bs4 import BeautifulSoup

data = "<p> >this line starts with an arrow <br /> this line does not </p>"

soup = BeautifulSoup(data)
for item in soup.find_all(text=lambda x: x.strip().startswith('>')):
    item.wrap(soup.new_tag('div'))

print soup.prettify()

打印:

<p>
    <div>
    >this line starts with an arrow
    </div>
    <br/>
    this line does not
</p>

答案 1 :(得分:3)

您可以尝试使用>\s+(>.*?)<正则表达式模式。

import re
regex = re.compile("\\>\\s{1,}(\\>.{0,}?)\\<")
testString = "" # fill this in
matchArray = regex.findall(testString)
# the matchArray variable contains the list of matches

并将匹配的组替换为<div> matched_group </div>。此处模式查找> ><中包含的任何内容。

以下是debuggex

上的演示

答案 2 :(得分:1)

你可以试试这个正则表达式,

>(\w[^<]*)

DEMO

Python代码将是,

>>> import re
>>> str = '"<p> >this line starts with an arrow <br /> this line does not </p>"'
>>> m = re.sub(r'>(\w[^<]*)', r"<div> >\1</div> ", str)
>>> m
'"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>"'