如何抓下一页(链接)

时间:2016-04-27 09:11:34

标签: python web-scraping beautifulsoup urllib2

我们现在通过以下代码访问了网站www.theft-alerts.com的第一页:

connection = urllib2.urlopen('http://www.theft-alerts.com')
soup = BeautifulSoup(connection.read().replace("<br>","\n"), "html.parser")

theftalerts = []
for sp in soup.select("table div.itemspacingmodified"):
    for wd in sp.select("div.itemindentmodified"):
        text = wd.text
        if not text.startswith("Images :"):
            print(text)

第一页的输出:

STOLEN : A LARGE TAYLORS OF LOUGHBOROUGH BELL
Stolen from Bromyard on 7 August 2014
Item : The bell has a diameter of 37 1/2" is approx 3' tall weighs just shy of half a ton and was made by Taylor's of Loughborough in 1902. It is stamped with the numbers 232 and 11.

The bell had come from Co-operative Wholesale Society's Crumpsall Biscuit Works in Manchester.
Any info to : PC 2361. Tel 0300 333 3000
Messages : Send a message
Crime Ref : 22EJ / 50213D-14

No of items stolen : 1

Location : UK > Hereford & Worcs
Category : Shop, Pub, Church, Telephone Boxes & Bygones
ID : 84377
User : 1 ; Antique/Reclamation/Salvage Trade ;  (Administrator)
Date Created : 11 Aug 2014 15:27:57
Date Modified : 11 Aug 2014 15:37:21;

在网站上有更多页面(1到19)。我们只看到第1页。我们如何获得其余页面?

我们尝试了这个:

connection = urllib2.urlopen('http://www.theft-alerts.com', 'http://www.theft-alerts.com/index-2.html', 'http://www.theft-alerts.com/index-3.html', 'http://www.theft-alerts.com/index-4.html','http://www.theft-alerts.com/index-5.html', 'http://www.theft-alerts.com/index-6.html', 'http://www.theft-alerts.com/index-7.html')

但这并不奏效。 输出:

"You can't pass both context and any of cafile, capath, and "
ValueError: You can't pass both context and any of cafile, capath, and cadefault

2 个答案:

答案 0 :(得分:0)

你可以获得下一页&#39;通过使用[Activity (NoHistory = true)] public class LoginActivity : Activity { } 类访问select datediff(enddate, startdate) + 1 from your_table 代码并在其中迭代code代码来链接:

resultnav

答案 1 :(得分:0)

为什么不用循环索引号?

for i in range(1, 20):
    connection = urllib2.urlopen("http://www.theft-alerts.com/index-%i.html" % i0
    # process the file here

对于一个更通用的解决方案,它一直持续到下一页不是有效链接:

i = 1
while True:
    conn = urllib2.urlopen("http://www.theft-alerts.com/index-%i.html" % i0
    if conn.getcode != 200:  # perhaps retry a couple of times
        break
    # process the file here
    i += 1

您的代码存在的问题是,您尝试将多个链接传递给urllib2.urlopen,而不是它的工作方式。您需要传递每个链接,然后处理响应。

以下是urlopen的签名,可以解释您所看到的错误:

def urlopen(url, data=None, timeout=socket.
            _GLOBAL_DEFAULT_TIMEOUT, 
            cafile=None, capath=None, cadefault=False, context=None)