Selenium Webdriver无法获得一些内容

时间:2018-01-29 21:38:52

标签: python selenium-webdriver

https://www.forrent.com/apartment-community-profile/1012635

我正在尝试解析一个网页,例如这个网页。 Selenium可能会返回此页面的部分内容,但不是全部内容。例如" 专业管理:B&员工 "是在网页中,但它不是由变量'内容'返回的。在脚本中。知道为什么会这样,如何解决这个问题?

driver = webdriver.Firefox(executable_path='/home/yliu/repos/funnel_objects/listing_sites/geckodriver')                                                                                                     
try:                                                                                                                                                                                                        
    driver.set_page_load_timeout(20)                                                                                                                                                                       
    driver.get(url)                                                                                                                                                                                         

    #WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "contactHeading")))                                                                                                             
    WebDriverWait(driver, 40)                                                                                                                                                                               
    html = driver.page_source                                                                                                                                                                               
    content = BeautifulSoup(html,"lxml")                                                                                                                                                                    
    driver.quit()                                                                                                                                                                                           
    return content                                                                                                                                                                                          
except TimeoutException:                                                                                                                                                                                    
    print('time out from contact')                                                                                                                                                                          
    return None       

1 个答案:

答案 0 :(得分:2)

该内容是一个延迟加载组件。滚动后会显示它。所以你需要一个脚本向下滚动到底部。请参阅下面的代码。

driver = webdriver.Firefox(executable_path='/home/yliu/repos/funnel_objects/listing_sites/geckodriver')                                                                                                     
try:                                                                                                                                                                                                        
    driver.set_page_load_timeout(20)                                                                                                                                                                       
    driver.get(url)                                                                                                                                                                                         

    #WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "contactHeading")))                                                                                                             
    #WebDriverWait(driver, 40)
    SCROLL_PAUSE_TIME = 0.5
    SCROLL_LENGTH = 200
    page_height = int(driver.execute_script("return document.body.scrollHeight"))
    scrollPosition = 0
    while scrollPosition < page_height:
        scrollPosition = scrollPosition + SCROLL_LENGTH
        driver.execute_script("window.scrollTo(0, " + str(scrollPosition) + ");")
        time.sleep(SCROLL_PAUSE_TIME)

    html = driver.page_source                                                                                                                                                                               
    content = BeautifulSoup(html,"lxml")                                                                                                                                                                    
    driver.quit()                                                                                                                                                                                           
    return content                                                                                                                                                                                          
except TimeoutException:                                                                                                                                                                                    
    print('time out from contact')                                                                                                                                                                          
    return None