如何检查<a href=""> element exist in a </a> <div> <a href=""> element?

时间:2018-12-23 15:30:09

标签: python beautifulsoup

the html code is like this:

<div class="AAA">Text of AAA<a href="......AAA/url">Display text of URL A</a></div>
<div class="BBB">Text of BBB<a href="......BBB/url">Display text of URL B</a></div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>

I want to parse the text for all the div, while check if there is url exist, if yes then also extract it out and display in output

output like this:

Text of AAA
Display text of URL A
......AAA/url
Text of BBB
Display text of URL B
......BBB/url
Text of CCC
Text of DDD

i tried to nest the loop of find_all('a') within find_all('div') loop, but messed up my output

5 个答案:

答案 0 :(得分:1)

不知道您的代码是什么样子,但是基本的想法是这样的:

data = soup.findAll('div')
for div in data:
    links = div.findAll('a')
    for a in links:
        print(a['href'])
        print(a.text)

将为您提供URL和文本。

答案 1 :(得分:1)

您可以遍历divs,然后打印soup.contents的元素:

s = """
<div class="AAA">Text of AAA<a href="......AAA/url">Display text of URL A</a> . 
</div>
<div class="BBB">Text of BBB<a href="......BBB/url">Display text of URL B</a> . 
</div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
"""
from bs4 import BeautifulSoup as soup
for _text, *_next in map(lambda x:x.contents, soup(s, 'html.parser').find_all('div')):
  print(_text)
  if _next:
    print(_next[0].text)
    print(_next[0]['href'])

输出:

Text of AAA
Display text of URL A
......AAA/url
Text of BBB
Display text of URL B
......BBB/url
Text of CCC
Text of DDD

答案 2 :(得分:1)

from bs4 import BeautifulSoup
html="""
<div class="AAA">Text of AAA<a href="......AAA/url">Display text of URL A</a></div>
<div class="BBB">Text of BBB<a href="......BBB/url">Display text of URL B</a></div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
"""
soup = BeautifulSoup(html, "lxml")
for div in soup.findAll('div'):
    print(div.text)
    try:
        print(div.find('a').text)
        print(div.find('a')["href"])
    except AttributeError:
        pass

输出

Text of AAADisplay text of URL A
Display text of URL A
......AAA/url
Text of BBBDisplay text of URL B
Display text of URL B
......BBB/url
Text of CCC
Text of DDD

答案 3 :(得分:0)

它更易于阅读,您也可以使用它来获得预期的输出

divs = soup.find_all('div')
for div in divs:
  print(div.contents[0]) # Text of AAA
  link = div.find('a')
  if link:
    print(link.text) # Display text of URL A
    print(link['href']) # ......AAA/url

答案 4 :(得分:0)

谢谢,我制定了解决方案

for h in ans_kin:
    links = ""
    link = h.find('a')
    if link:
        for l in link:
            links = h.text + link.get('href')
    else:
        links = h.text

    answer_kin.append(links)