如何使用beautifulsoup和python只获得mp3链接

时间:2014-08-29 08:51:36

标签: python beautifulsoup

这是我的代码:

from bs4 import BeautifulSoup
import urllib.request
import re

url = urllib.request.urlopen("http://www.djmaza.info/Abhi-Toh-Party-Khubsoorat-Full-Song-MP3-2014-Singles.html")
content = url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
    if re.findall('http',a['href']):
        print ("URL:", a['href'])

此代码的输出:

URL: http://twitter.com/mp3khan
URL: http://www.facebook.com/pages/MP3KhanCom-Music-Updates/233163530138863
URL: https://plus.google.com/114136514767143493258/posts
URL: http://www.djhungama.com
URL: http://www.djhungama.com
URL: http://songs.djmazadownload.com/music/Singles/Abhi Toh Party (Khoobsurat) -190Kbps [DJMaza.Info].mp3
URL: http://songs.djmazadownload.com/music/Singles/Abhi Toh Party (Khoobsurat) -190Kbps [DJMaza.Info].mp3
URL: http://songs.djmazadownload.com/music/Singles/Abhi Toh Party (Khoobsurat) -320Kbps [DJMaza.Info].mp3
URL: http://songs.djmazadownload.com/music/Singles/Abhi Toh Party (Khoobsurat) -320Kbps [DJMaza.Info].mp3
URL: http://www.htmlcommentbox.com
URL: http://www.djmaza.com
URL: http://www.djhungama.com

我只需要.mp3链接。

那么,我该如何重写代码?

谢谢

3 个答案:

答案 0 :(得分:3)

更改您的findAll以使用正则表达式进行匹配,例如:

for a in soup.findAll('a',href=re.compile('http.*\.mp3')):
    print ("URL:", a['href'])

与评论有关的更新:

  

我需要将这些链接存储在阵列上以供下载。我怎么能这样做?

您可以使用list-comprehension来构建列表:

links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]

答案 1 :(得分:2)

您可以使用.endswith()。例如:

if re.findall('http',a['href']) and a['href'].endswith(".mp3"):

答案 2 :(得分:1)

如果只有您感兴趣的扩展名,那么您必须知道endswith()返回布尔值而不是文件的扩展名。最好为此目的构建自己的函数,如下所示:

if re.findall('http',a['href']) and isMP3file(a['href'])):

现在你可以用这种方式定义函数:

import os
def isMP3file(link):
    name, ext = os.path.splitext(link)
    return ext.lower() == '.mp3'