beautifulsoup在<p> <h3>之后的<p>标签和<p>之间的标签之后获得标签

时间:2017-03-15 12:03:53

标签: beautifulsoup

我有以下原始html:

<h3>Job Description</h3>
<p>We are recruiting part time or full time cashier, to be based at our restaurant at Fraser Place, Jalan Perak.</p>
<p>Work Day: Monday to Friday<br> Work Hour: 7am-5pm (Full time), 9am-2pm (Part Time) or 10am-2pm (Part Time)</p>
<p>Full time rate at RM1600-RM1800 per month depends on experience, part time rate RM7-RM8/ hour depends on experience.</p>
<hr>
<h3>Working Location </h3>

我正在尝试将“Job Descrtion”下的所有文字排除在<hr>标记之外

我试过了:

for header in soup.find_all('h3'):
    para = header.find_next_sibling('p')

但只能设法在“工作取消”之后获得第一个<p>,并且它不会在<br>标记内的<p>标记上运行

1 个答案:

答案 0 :(得分:0)

您可以迭代header兄弟姐妹,直到您匹配hr

示例:

example = """<h3>Job Description</h3>
<p>We are recruiting part time or full time cashier, to be based at our 
restaurant at Fraser Place, Jalan Perak.</p>
<p>Work Day: Monday to Friday**<br>** Work Hour: 7am-5pm (Full time), 9am-2pm 
(Part Time) or 10am-2pm (Part Time)</p>
<p>Full time rate at RM1600-RM1800 per month depends on experience, part time 
rate RM7-RM8/ hour depends on experience.</p>
<hr>
<h3>Working Location </h3>"""

soup = BeautifulSoup(example, 'html.parser')
for header in soup.find_all('h3'):
    nextNode = header
    while True:
        nextNode = nextNode.nextSibling
        if nextNode is None:
            break
        if nextNode.name is not None:
            if nextNode.name == "hr":
                break
            print (nextNode.get_text(strip=True))

输出:

We are recruiting part time or full time cashier, to be based at our 
restaurant at Fraser Place, Jalan Perak.
Work Day: Monday to Friday**** Work Hour: 7am-5pm (Full time), 9am-2pm (Part 
Time) or 10am-2pm (Part Time)
Full time rate at RM1600-RM1800 per month depends on experience, part time 
rate RM7-RM8/ hour depends on experience.