如何获取更多数据

时间:2018-10-26 20:16:27

标签: python python-3.x web-scraping beautifulsoup python-requests

我正尝试在以下站点上下载它们拥有的所有钻石:https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll

计划是获取信息并尝试找出哪一个是我最喜欢的商品(我将做一些回归,以确定哪些商品具有很高的价值并选择我的最爱)

为此,我写了第一把刮刀。问题在于,它似乎只需要第一颗60颗钻石,而不是我在网站上看到的所有钻石。理想情况下,我希望它能带走所有100k +种不同类型(圆形,垫形等)的钻石。 如何获取所有数据?

(我认为这是因为一些新行仅在向下滚动后才加载,但我认为第一行加载的负载超过60,并且如果我向下滚动至底部,则只能显示1000)

这是我的代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll'

url_response = requests.get(url)
soup = BeautifulSoup(url_response.content, "html.parser")

""" Now we have the page as soup

Lets start to get the header"""

headerinctags = soup.find_all('div', class_='grid-header normal-header')
header = headerinctags[0].get_text(';')

diamondsmessy = soup.find_all('a', class_='grid-row row ')
diamondscleaned = diamondsmessy[1].get_text(";")


"""Create diamonds dataframe with the header; take out the 1st value"""
header = header.split(";")
del header[0]
diamonds = pd.DataFrame(columns=header)

""" place rows into dataframe after being split; use a & b as dummy variables; take out 5th value"""

for i in range(len(diamondsmessy)):
    a = diamondsmessy[i].get_text(";")
    b = a.split(";")
    del b[4]
    a = pd.DataFrame(b, index=header)
    b = a.transpose()
    diamonds = pd.concat([diamonds, b], ignore_index=True)

print(diamonds)

1 个答案:

答案 0 :(得分:0)

我已经知道如何去做。速度不快,但本质上我需要硒才能向下滚动页面。我仍然卡在1000行中,所以循环播放一些内容来更新页面。

要帮助他人,请在此处提供代码:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#for fun, let's time this
start = time.time()

"""Define important numbers"""

scroll_pauze_time = 0.5 #delay after scroll
scroll_number = 20 #number of times scrolled per page
pages_visited = 25 #number of times the price is increased

"""Set up the website"""

url = 'https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll'

url_response = webdriver.Firefox()
url_response.get(url)

#minimum & max carat:
min_carat = url_response.find_element_by_css_selector('.carat-filter .allowHighAscii:nth-child(1)')
min_carat.send_keys('0.8')
min_carat.send_keys(Keys.ENTER)

max_carat = url_response.find_element_by_css_selector('.carat-filter .allowHighAscii:nth-child(2)')
max_carat.send_keys('1.05')
max_carat.send_keys(Keys.ENTER)


#Shapes of diamonds:
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(2) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(4) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(5) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(7) > .shape-filter-button-inner').click()

"""Create diamonds dataframe with the header; take out the 1st value"""
soup = BeautifulSoup(url_response.page_source, "html.parser")

headerinctags = soup.find_all('div', class_='grid-header normal-header')
header = headerinctags[0].get_text(';')

header = header.split(";")
del header[0]
diamonds = pd.DataFrame(columns=header)

"""Start loop, dummy variable j"""
for j in range(pages_visited):

    print(j)
    url_response.execute_script("window.scrollTo(0, 0)")

    #Set the minimum price
    if j != 0:
        min_price = url_response.find_element_by_css_selector('input[name="minValue"]')

        min_price.send_keys(Keys.CONTROL,"a");
        min_price.send_keys(Keys.DELETE);

        a = diamonds.loc[len(diamonds.count(1))-1,"Price"]
        a = a.replace('$','')
        a = a.replace(',','')
        min_price.send_keys(a)
        min_price.send_keys(Keys.ENTER)

    #Scroll down
    for i in range(scroll_number):
            url_response.execute_script("window.scrollTo(0, "+str((i+1)*2000)+')')
            time.sleep(scroll_pauze_time)

    #Grab data
    soup = BeautifulSoup(url_response.page_source, "html.parser")
    diamondsmessy = soup.find_all('a', class_='grid-row row ')


    """ place rows into dataframe after being split; use a & b as dummy variables; take out 5th value"""

    for i in range(len(diamondsmessy)):
        a = diamondsmessy[i].get_text(";")
        b = a.split(";")
        del b[4]
        a = pd.DataFrame(b, index=header)
        b = a.transpose()
        diamonds = pd.concat([diamonds, b], ignore_index=True)

diamonds = diamonds.drop_duplicates()
diamonds.to_csv('diamondsoutput.csv')

print(diamonds)

end = time.time()
print("This took "+ str(end-start)+" seconds")