如何使用Selenium Python从动态网站检索所有链接

时间:2019-03-06 05:35:19

标签: javascript python json selenium-webdriver web-scraping

我有以下代码:

rom selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException


chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')

prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")
wait = WebDriverWait(driver,20)
links = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ess-product-brand + [href]")))
results = [link.get_attribute("href") for link in links]
#print(links)
print(results)
driver.quit()

但是,我仅获得特色产品的结果/链接,而不是所有产品。有时,(很少)如果我运行20次,我会得到所有产品。但我希望始终能够获得所有产品。我还在下面尝试了另一种方法:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")

links = [elem.get_attribute("href") for elem in driver.find_elements_by_tag_name('a')]

print(links)

同样的问题。 我的问题是,我无法获得所有链接会丢失什么?这已经让我抓狂了两个星期了。我还尝试延迟计时器,以为它可能没有加载,但仍然无法正常工作。谢谢

1 个答案:

答案 0 :(得分:1)

您可以尝试通过提取结果计数总计并向其中添加特征总计来使用控件总计。这些数字已经可供您使用,因此您可以循环播放直到#hrefs满足此要求为止。您可能想在循环中添加一个超时时间。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')

prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")
wait = WebDriverWait(driver,20)
nonFeaturedTotal = int(wait.until(EC.presence_of_element_located((By.CSS_SELECTOR , '.ess-view-item-count-text'))).text.split(' ')[-1])
featuredTotal = len(wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ess-product-container-featured"))))
expectedTotal = featuredTotal + nonFeaturedTotal

while False:
    len(driver.find_elements_by_css_selector(".ess-product-brand + [href]")) == expectedTotal

links = driver.find_elements_by_css_selector(".ess-product-brand + [href]")
results = [link.get_attribute("href") for link in links]

print(len(results))
print(links)

driver.quit()