我试图在Python中列出网站的所有页面,以便使用BeautifulSoup进行抓取。我现在拥有的是:
team_urls = ['http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html',
'http://www.lyricsfreak.com/e/ed+sheeran/a+team_20983411.html',
'http://www.lyricsfreak.com/e/ed+sheeran/i+see+fire_21071421.html',
'http://www.lyricsfreak.com/e/ed+sheeran/perfect_21113253.html',
'http://www.lyricsfreak.com/e/ed+sheeran/castle+on+the+hill_21112527.html',
'http://www.lyricsfreak.com/e/ed+sheeran/supermarket+flowers_21113249.html',
'http://www.lyricsfreak.com/e/ed+sheeran/lego+house_20983415.html',
'http://www.lyricsfreak.com/e/ed+sheeran/even+my+dad+does+sometimes_21085123.html',
'http://www.lyricsfreak.com/e/ed+sheeran/kiss+me_20983414.html',
'http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html',
'http://www.lyricsfreak.com/e/ed+sheeran/i+see+fire_21071421.html'
]
我想调用一个函数来提取所有以http://www.lyricsfreak.com/e/ed+sheeran/
开头的网站,因为我知道当前列表很邋and,而且还有大约30个可用,而不仅仅是手动添加。< / p>
答案 0 :(得分:0)
在Python 2.x中,您可以按如下方式创建子域列表:
from bs4 import BeautifulSoup
import urllib2
base_url = 'http://www.lyricsfreak.com'
request = urllib2.Request(base_url + '/e/ed+sheeran/', headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'})
response = urllib2.urlopen(request)
soup = BeautifulSoup(response.read(), 'html.parser')
urls = []
for tr in soup.select('tbody tr'):
urls.append(base_url + tr.td.a['href'])
print urls
这将创建一个urls
列表开始:
['http://www.lyricsfreak.com/e/ed+sheeran/a+team_20983411.html', 'http://www.lyricsfreak.com/e/ed+sheeran/afire+love_21084845.html', ...
在Python 3.x中,可以修改如下:
from bs4 import BeautifulSoup
import urllib
base_url = 'http://www.lyricsfreak.com'
resp = urllib.request.urlopen(base_url + '/e/ed+sheeran/')
soup = BeautifulSoup(resp, 'html.parser')
urls = []
for tr in soup.select('tbody tr'):
urls.append(base_url + tr.td.a['href'])
print(urls)
或使用requests
库,如下所示:
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.lyricsfreak.com'
response = requests.get(base_url + '/e/ed+sheeran/')
soup = BeautifulSoup(response.text, 'html.parser')
urls = []
for tr in soup.select('tbody tr'):
urls.append(base_url + tr.td.a['href'])
使用安装:
pip install requests