如何在python中解析HTML标记层次结构?

时间:2019-03-06 05:16:19

标签: python html beautifulsoup html-parsing

我有一个html页面,我在其中使用漂亮的汤提取所有标头(h1h7),现在我想要一个列表,希望将所有直接的更高级别的标签附加到列表中。当前标签。

例如,我有以下示例html页面:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
<h1>dummy h1</h1>
<h1>head 1</h1>
<p>para 1</p>
<h2>head 2</h2>
<p>para 2</p>
<h3>head 3</h3>
<p>p for head3</p>
<h2>head2(2)</h2>
<p>para3</p>
<h1>head1(2)</h1>
<h2>2nd h2</h2>
<h3>2nd h3</h3>
<p>2nd p for h3</p>
</body>
</html>

我想要的列表应为

['head1','head1 head2','head1 head2 head3','head1 head2(2)','head1(2)','head1(2) 2nd h2','head1(2) 2nd h2 2nd h3']

我所使用的逻辑是,当我遇到一个较小的h标签,并从当前h标签向后遍历时,就会中断循环。这造成了一个问题,因为循环在head3处中断,而从head2(2)向后移动,理想情况下该循环应该上升到head1。这是我尝试的代码:

file = open("sample.html","r")
page = file.read()
soup = BeautifulSoup(page, 'html.parser')
tags=['h1','h2','h3','h4','h5','h6','h7']
start=soup.find('h1') # the page I am working on starts with a dummy

head=[]
h=[]
h3=[]     

for ele in start.next_siblings:
    for i,tag in enumerate(tags):
        if (ele.name==tag):
            head.append('')
            h.append(ele)
            h3=deepcopy(h)
            h3.reverse()
            for j, q in enumerate(h3):
                if q.name in tags[:i]:
                    head[len(head)-1]=(q.text.strip()) + ' ' + head[len(head)-1]

                if j < len(h)-1 and (tags.index(q.name) == tags.index(h3[j+1].name)):
                    continue

                if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)):
                    break

            head[len(head)-1]+=(ele.text.strip())+' '
            break
print(head)

请提出如何避免此问题的建议。

1 个答案:

答案 0 :(得分:0)

我发现您的算法出了什么问题。您只需要在您的q.name条件下对break的值进行测试

if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)) and q.name == 'h1':
    break

因此完整的代码将是:

file = open("sample.html","r")
page = file.read()
soup = BeautifulSoup(page, 'html.parser')
tags=['h1','h2','h3','h4','h5','h6','h7']
start=soup.find('h1') # the page I am working on starts with a dummy

head=[]
h=[]
h3=[]

for ele in start.next_siblings:
    for i,tag in enumerate(tags):
        if (ele.name==tag):
            head.append('')
            h.append(ele)
            h3=deepcopy(h)
            h3.reverse()
            for j, q in enumerate(h3):

                if q.name in tags[:i]:
                    head[len(head)-1]=(q.text.strip()) + ' ' + head[len(head)-1]

                if j < len(h)-1 and (tags.index(q.name) == tags.index(h3[j+1].name)):
                    continue

                if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)) and q.name == 'h1':
                    break

            head[len(head)-1]+=(ele.text.strip())+' '
            break
print(head)

输出:

['head 1 ', 'head 1 head 2 ', 'head 1 head 2 head 3 ', 'head 1 head2(2) ', 'head1(2) ', 'head1(2) 2nd h2 ', 'head1(2) 2nd h2 2nd h3 ']

让我知道是否有帮助:-)