使用BeatifulSoup选择其他两个标签之间的所有标签(并提取文本)

时间:2018-11-11 18:53:10

标签: python html beautifulsoup tags

我想提取两个标签之间包含的给定标签的所有实例。目前,我正在与BeautifulSoup合作。 您可以在下面找到一个示例:

<p class='x' id = '1'> some content 1 <p>
<p class='y' id = 'a'> some content a <p>
<p class='y' id = 'b'> some content b <p>
<p class='y' id = 'c'> some content c <p>
<p class='potentially some other class'> <p>
<p class='x' id = '2'> some content 2 <p>
<p class='y' id = 'd'> some content d <p>
<p class='y' id = 'e'> some content e <p>
<p class='y' id = 'f'> some content f <p>

我有兴趣在两个标记“ x”之间选择类“ y”的所有实例,它们也具有不同的ID。关于特定示例,我想选择class ='y'的所有p来检索文本。我最终希望得到的输出是:“某些内容a”,“某些内容b”和“某些内容c”。

我尝试使用findAllNext方法,但这给了我“某些内容a”,“某些内容b”,“某些内容c”和“某些内容d”,“某些内容e”,“某些内容f”。

下面是我的代码

par = BeautifulSoup(HTML_CODE).content, 'lxml') 
loc = par.find('p', class_ = 'x', id ='1')
desired = loc.findAllNext('p', class_ = 'y')

是否有办法避免也选择出现在id ='2'的class ='x'标记之后的class ='y'实例?

谢谢。

1 个答案:

答案 0 :(得分:2)

您可以从所需的位置开始迭代,然后结束直到发现标记完成为止。

from bs4 import BeautifulSoup

html = """

<p class='x' id = '1'> some content 1 </p>
<p class='y' id = 'a'> some content a </p>
<p class='y' id = 'b'> some content b </p>
<p class='y' id = 'c'> some content c </p>
<p class='potentially some other class1'> potentially some other class 1 </p>
<p class='potentially some other class2'> potentially some other class 2</p>
<p class='potentially some other class3'> potentially some other class 3 </p>
<p class='x' id = '2'> some content 2 </p>
<p class='y' id = 'd'> some content d </p>
<p class='y' id = 'e'> some content e </p>
<p class='y' id = 'f'> some content f </p>
"""

soup = BeautifulSoup(html,"lxml")
start = soup.find("p",class_="y",id="c")
end = soup.find("p",class_="x",id="2")
def next_ele(ele,result=[]):
    row = ele.find_next("p")
    if not row or row == end:
        return result
    result.append(row)
    return next_ele(row,result)

print(next_ele(start))