Python Scrapy / Selenium正在跳过我的大多数迭代

时间:2017-01-10 02:00:45

标签: python selenium scrapy

我试图刮掉零售服装购物网站。出于某种原因,每当我运行以下代码时,我最终会从三个类别(在parse()中定义为nth-children)和li:nth-​​child(5)中的一系列项目中获取几个项目。 / p>

有时会出现以下错误:

2017-01-09 20:33:30 [scrapy] ERROR: Spider error processing <GET http://www.example.com/jackets> (referer: http://www.example.com/)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/BeardedMac/projects/thecurvyline-scraper/spiders/example.py", line 47, in parse_items
    price = node.find_element_by_css_selector('div.flex-wrapper--prod-details > div.pricing > div.price > div.standardprice').text
  File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 307, in find_element_by_css_selector
    return self.find_element(by=By.CSS_SELECTOR, value=css_selector)
  File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 511, in find_element
    {"using": by, "value": value})['value']
  File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 494, in _execute
    return self._parent.execute(command, params)
  File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 192, in check_response
    raise exception_class(message, screen, stacktrace)
StaleElementReferenceException: Message: The element reference is stale. Either the element is no longer attached to the DOM or the page has been refreshed

但是,如果我将第n个子选择器更改为li:nth-​​child(3),我会从该类别中获得大量项目,但我似乎无法立即获取所有这些项目。

我对Python和Scrapy很陌生,所以我可能只是缺少一些元素。

def __init__(self):
    self.driver = webdriver.Chrome('/MyPath/chromedriver')
    self.driver.set_page_load_timeout(10)

def parse(self, response):
    for href in response.css('#main-menu > div > li:nth-child(n+3):nth-child(-n+6) > a::attr(href)').extract():
        yield scrapy.Request(response.urljoin(href), callback=self.parse_items)

def get_item(self, response):
    sizes = response.css('#pdpMain > div.productdetailcolumn.productinfo > div > div.variationattributes > div.swatches.size > ul > li > a::text').extract()
    product_id = response.css('#riiratingsfavorites > div.riiratings > a::attr(rel)').extract_first()
    response.meta['product']['sizes'] = sizes
    response.meta['product']['product_id'] = product_id
    yield response.meta['product']


def parse_items(self, response):
    category = response.css('#shelf > div.category-header > h2::text').extract_first()
    self.driver.get(response.url)
    nodes = self.driver.find_elements_by_css_selector('#search > div.productresultarea > div.product.producttile')
    for node in nodes:
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
        price = node.find_element_by_css_selector('div.flex-wrapper--prod-details > div.pricing > div.price > div.standardprice').text 
        images = node.find_element_by_css_selector('div.image > div.thumbnail > p > a > img:nth-child(1)').get_attribute('src')
        name = node.find_element_by_css_selector('div.flex-wrapper--prod-details > div.name > a').text
        product_url = node.find_element_by_css_selector('div.flex-wrapper--prod-details > div.name > a').get_attribute('href')
        product = Product ()
        product['title'] = name
        product['price'] = price
        product['product_url'] = product_url
        product['retailer'] = 'store7'
        product['categories'] = category
        product['images'] = images
        product['sizes'] = []
        product['product_id'] = []
        product['base_url'] = '' 
        product_page = response.urljoin(product_url)
        yield scrapy.Request(product_page, callback=self.get_item, meta={'product': product})

1 个答案:

答案 0 :(得分:0)

简而言之 - 这里发生的事情是因为scrapy是并发的,而你的selenium实现不是,你的selenium驱动程序会混淆 - 在你的爬行scrapy期间一直要求你的selenium驱动程序在它仍在使用时加载新的url旧的。

要避免这种情况,您可以通过将CONCURRENT_REQUESTS设置为1来禁用蜘蛛的并发性。例如。将其添加到您的settings.py文件中:

CONCURRENT_REQUESTS = 1
如果您希望将此设置限制为一个蜘蛛,请在

中添加custom_settings条目:

class MySpider(scrapy.Spider):
    custom_settings = {'CONCURRENT_REQUESTS', 1}

如果你想保持并发(这是一件非常好的事情)你可以尝试用更友好的python技术替换selenium,例如Splash