使用BeautifulSoup

时间:2017-05-15 14:25:50

标签: python regex bs4

<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

我使用bs4而我无法使用a.attrs['src']来获取src,但我可以获得href。我该怎么办?

4 个答案:

答案 0 :(得分:16)

您可以使用BeautifulSoup提取src代码的html img属性。在我的示例中,htmlText包含img标记本身,但也可以将其与urllib2一起用于网址。

适用于网址

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    #print image source
    print image['src']
    #print alternate text
    print image['alt']

对于带有img标签的文本

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print image['src']

答案 1 :(得分:6)

链接没有属性src,您必须定位实际的img代码。

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'

答案 2 :(得分:1)

您可以使用BeautifulSoup提取html img标签的src属性。在我的示例中,htmlText包含img标记本身,但是它也可以与urllib2一起用于URL。

最受好评的答案提供的解决方案不适用于python3。这是正确的实现:

对于URL

from bs4 import BeautifulSoup as BSHTML
import urllib3

http = urllib3.PoolManager()
url = 'your_url'

response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')

for image in images:
    #print image source
    print(image['src'])
    #print alternate text
    print(image['alt'])

用于带有img标签的文本

from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])

答案 3 :(得分:1)

这是一个解决方案,如果img标签没有src属性,则不会触发KeyError:

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

images = bs.find_all('img')
for img in images:
    if img.has_attr('src'):
        print(img['src'])