Question

我正在使用python脚本进行一些网页报废。我想在网页上找到给定部分的基本URL，如下所示：

<div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
    ...
</div>

所以，除了数字之外，我只需要从第一个href获取所有内容（＆＃39;网页类别/页面/＆＃39;），我有以下工作代码：

pages = [l['href'] for link in soup.find_all('div', class_='pagination')
     for l in link.find_all('a') if not re.search('pageSub', l['href'])]

s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])

问题是，生成此列表是一种浪费，因为我只需要第一个href。我认为发电机将是答案，但我无法解决这个问题。也许你们可以帮助我让这段代码更简洁？

Answer 1

这个怎么样：

from bs4 import BeautifulSoup

html = """ <div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
</div>"""

soup = BeautifulSoup(html)

link = soup.find('div', {'class': 'pagination'}).find('a')['href']

print '/'.join(link.split('/')[:-1])

打印：

webpage-category/page

仅供参考，谈到您提供的代码 - 您可以使用[next（）] [ - 1]代替列表理解：

s = next(l['href'] for link in soup.find_all('div', class_='pagination')
         for l in link.find_all('a') if not re.search('pageSub', l['href']))

UPD（使用提供的网站链接）：

import urllib2
from bs4 import BeautifulSoup


url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))

links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')

print next('/'.join(link['href'].split('/')[:-1]) for link in links 
           if link.text.isdigit() and link.text != "1")

改进python片段

1 个答案: