改进python片段

时间:2014-03-13 17:08:23

标签: python html web-scraping html-parsing beautifulsoup

我正在使用python脚本进行一些网页报废。我想在网页上找到给定部分的基本URL,如下所示:

<div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
    ...
</div>

所以,除了数字之外,我只需要从第一个href获取所有内容(&#39;网页类别/页面/&#39;),我有以下工作代码:

pages = [l['href'] for link in soup.find_all('div', class_='pagination')
     for l in link.find_all('a') if not re.search('pageSub', l['href'])]

s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])

问题是,生成此列表是一种浪费,因为我只需要第一个href。我认为发电机将是答案,但我无法解决这个问题。也许你们可以帮助我让这段代码更简洁?

1 个答案:

答案 0 :(得分:2)

这个怎么样:

from bs4 import BeautifulSoup

html = """ <div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
</div>"""

soup = BeautifulSoup(html)

link = soup.find('div', {'class': 'pagination'}).find('a')['href']

print '/'.join(link.split('/')[:-1])

打印:

webpage-category/page

仅供参考,谈到您提供的代码 - 您可以使用[next()] [ - 1]代替列表理解:

s = next(l['href'] for link in soup.find_all('div', class_='pagination')
         for l in link.find_all('a') if not re.search('pageSub', l['href']))

UPD(使用提供的网站链接):

import urllib2
from bs4 import BeautifulSoup


url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))

links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')

print next('/'.join(link['href'].split('/')[:-1]) for link in links 
           if link.text.isdigit() and link.text != "1")