使用Python Beautiful Soup刮擦Paginated页面

时间:2017-07-10 13:24:53

标签: python beautifulsoup bs4

我的刮刀工作正常,它从网站上的所有9个页面中提取正确的数据。我遇到的一个问题是我相信我目前使用的方法并不理想(如果页码大于我输入的范围,那么这些结果将被遗漏)。

我的代码如下:

import requests
import time
import csv
import sys
from bs4 import BeautifulSoup

houses = []

url = "https://www.propertypal.com/property-to-rent/newtownabbey/"
page=requests.get(url)
soup=BeautifulSoup(page.text,"lxml")
g_data = soup.findAll("div", {"class": "propbox-details"})
for item in g_data:
    try:
        title = item.find_all("span", {"class": "propbox-addr"})[0].text
    except:
        pass
    try:
        town = item.find_all("span", {"class": "propbox-town"})[0].text
    except:
        pass
    try:
        price = item.find_all("span", {"class": "price-value"})[0].text
    except:
        pass
    try:
        period = item.find_all("span", {"class": "price-period"})[0].text
    except:
        pass
    course=[title,town,price,period]
    houses.append(course)


for i in range(1,15):
    time.sleep(2)#delay time requests are sent so we don't get kicked by server
    url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)
    page2=requests.get(url2)
    print(url2)
    soup=BeautifulSoup(page2.text,"lxml")
    g_data = soup.findAll("div", {"class": "propbox-details"})
    for item in g_data:
        try:
            title = item.find_all("span", {"class": "propbox-addr"})[0].text
        except:
            pass
        try:
            town = item.find_all("span", {"class": "propbox-town"})[0].text
        except:
            pass
        try:
            price = item.find_all("span", {"class": "price-value"})[0].text
        except:
            pass
        try:
            period = item.find_all("span", {"class": "price-period"})[0].text
        except:
            pass

        course=[title,town,price,period]
        houses.append(course)


with open ('newtownabbeyrentalproperties.csv','w') as file:
   writer=csv.writer(file)
   writer.writerow(['Address','Town', 'Price', 'Period'])
   for row in houses:
      writer.writerow(row)

从我正在使用的代码中可以看出

for i in range(1,15):
    time.sleep(2)#delay time requests are sent so we don't get kicked by server
    url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)   

将数字1到14添加到& page =参数中。

这不是理想的,因为该网站有额外的页数,例如第15,16,17页,那么刮刀将错过这些页面上的数据,因为它只会查看第14页的最大数据

有人可以提供如何使用分页来查找要抓取的网页上的页数,或者更好的方法来设置此for循环吗?

非常感谢。

3 个答案:

答案 0 :(得分:1)

请参阅下面的修改。此解决方案应该能够继续循环遍历页面,直到它尝试获取不存在的页面。这样做也是有益的,因为在你的代码中你总是会尝试15页,即使只有一个,两个或三个等等。

page_num = 0
http_status_okay = True
while http_status_okay:
    page_num = page_num + 1
    time.sleep(2)#delay time requests are sent so we don't get kicked by server
    url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)
    page2=requests.get(url2)

    # continue if we get a 200 response code
    if page2.status_code is 200:
        http_status_okay = True
    else:
        http_status_okay = False

答案 1 :(得分:1)

这样的东西(我没有测试过,它可能有效或无效,只是想表明原理)

button_next = soup.find("a", {"class": "btn paging-next"}, href=True)
while button_next:
    time.sleep(2)#delay time requests are sent so we don\'t get kicked by server
    url2 = "https://www.propertypal.com{0}".format(button_next["href"])
    page2=requests.get(url2)
    print(url2)
    soup=BeautifulSoup(page2.text,"lxml")
    g_data = soup.findAll("div", {"class": "propbox-details"})
    for item in g_data:
        try:
            title = item.find_all("span", {"class": "propbox-addr"})[0].text
        except:
            pass
        try:
            town = item.find_all("span", {"class": "propbox-town"})[0].text
        except:
            pass
        try:
            price = item.find_all("span", {"class": "price-value"})[0].text
        except:
            pass
        try:
            period = item.find_all("span", {"class": "price-period"})[0].text
        except:
            pass

    course=[title,town,price,period]
    houses.append(course)

    button_next = soup.find("a", {"class": "btn paging-next"}, href=True)

答案 2 :(得分:0)

向不存在的页面发出请求。例如:https://www.propertypal.com/property-to-rent/newtownabbey/page-999999 查找存在且不存在的页面之间的差异。 解析下一页,直到找到这种差异。