无法摆脱超时异常错误

时间:2018-06-11 21:00:28

标签: python python-3.x selenium selenium-webdriver web-scraping

运行我的以下脚本后,它成功抓取第一个链接并获取titledescription,但是当我在下一个链接中执行相同操作时,我会在此行中遇到stale element reference: data = [urljoin(link,item.get_attribute("href"))---。如何在没有此错误的情况下完成操作?

这是剧本:

from urllib.parse import urljoin
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "http://urbantoronto.ca/database/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)

for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#project_list table tr[id^='project']"))):
    data = [urljoin(link,item.get_attribute("href")) for item in items.find_elements_by_css_selector("a[href^='//urbantoronto']")]

    #I get stale "element reference" error exactly here pointing the above line

    for nlink in data:
        driver.get(nlink)
        sitem = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1.title")))
        title = sitem.text
        try:
            desc = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".project-description p"))).text
        except Exception: desc = ""
        print("Title: {}\nDescription: {}\n".format(title,desc))

driver.quit()

1 个答案:

答案 0 :(得分:1)

真正的问题是你的外循环。一旦您更改页面,您迭代的“项目”就会过时,即driver.get(nlink)。这就是你在第二次通过items.find_elements循环得到StaleElementException的原因......它在'sitem'上超时的原因是因为元素只在DOM改变时变得陈旧。如果DOM没有改变,那么你可能需要等待一段陈旧的元素。

考虑到这一点,我建议使用BeautifulSoup略有不同的方法。 Selenium非常适合javascript执行,但是在解析HTML时有点慢,这是你为所有这些表行做的事情。所以我建议进行以下更改:

key_filename

编辑:这是一种纯硒溶液:

from urllib.parse import urljoin
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup as bs

link = "http://urbantoronto.ca/database/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)

# For readability
by_selector = (By.CSS_SELECTOR, "#project_list table tr[id^='project']")
wait.until(EC.presence_of_all_elements_located(by_selector))

# Get HTML content
soup = bs(driver.page_source, 'lxml')

# Find div containing project table
table = soup.find('div', {'id': 'project_list'})

# Find all the project rows
projects = table.find_all('tr', {'id': re.compile('^project\d+')})

# Create page links
links = ['http:' + x.find('a')['href'] for x in projects]

for nlink in links:

    driver.get(nlink)
    sitem = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1.title")))
    title = sitem.text
    try:
        desc = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".project-description p"))).text
    except Exception: desc = ""
    print("Title: {}\nDescription: {}\n".format(title,desc))

driver.quit()

为了清楚起见,您需要在循环之前提取URL以避免您遇到的陈旧元素问题。