Question

我正在尝试从网站获取数据，但是我想选择第一个打开的1000个链接，并从那里获取数据。

我尝试过：

list_links = driver.find_elements_by_tag_name('a')

for i in list_links:
        print (i.get_attribute('href'))

通过此操作获得不需要的额外链接。

例如：https://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1,2,3,4,5,%3E5&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&cityName=Mumbai

我们将获得超过5万个链接。如何只打开前1000个链接，下面带有属性照片。

修改

我也尝试过这样做：

driver.find_elements_by_xpath("//div[@class='.l-srp__results.flex__item']")
driver.find_element_by_css_selector('a').get_attribute('href')

for matches in driver:
    print('Liking')
    print (matches)
    #matches.click()
    time.sleep(5)

但出现错误：TypeError: 'WebDriver' object is not iterable

为什么我无法通过以下行获得链接：driver.find_element_by_css_selector('a').get_attribute('href')

编辑1

我正在尝试按以下方式对链接进行排序，但出现错误

            result = re.findall(r'https://www.magicbricks.com/propertyDetails/', my_list)
            print (result)

错误：TypeError：预期的字符串或类似字节的对象

或尝试过

            a = ['https://www.magicbricks.com/propertyDetails/']
            output_names = [name for name in a if (name[:45] in my_list)]
            print (output_names)

什么也没得到。

所有链接都在列表中。请建议

先谢谢您。请建议

Answer 1

硒不是刮网的好主意。我建议您使用免费且开源的JMeter。

http://www.testautomationguru.com/jmeter-how-to-do-web-scraping/

如果要使用硒，您尝试遵循的方法不是稳定的方法-单击并获取数据。相反，我建议您遵循此-这里类似。该示例在Java中。但是你可以理解。

driver.get("https://www.yahoo.com");

Map<Integer, List<String>> map = driver.findElements(By.xpath("//*[@href]")) 
                .stream()                             // find all elements which has href attribute & process one by one
                .map(ele -> ele.getAttribute("href")) // get the value of href
                .map(String::trim)                    // trim the text
                .distinct()                           // there could be duplicate links , so find unique
                .collect(Collectors.groupingBy(LinkUtil::getResponseCode)); // group the links based on the response code

更多信息在这里。

http://www.testautomationguru.com/selenium-webdriver-how-to-find-broken-links-on-a-page/

Answer 2

我相信您应该收集列表中具有标签名称“ a”和“ href”属性（不为null）的所有元素。
然后遍历列表，然后逐一单击元素。
创建类型为WebElement的列表，并存储所有有效链接。
您可以在此处应用更多过滤条件或条件，即链接包含某些字符或其他条件。

要将WebElement存储在列表中，可以使用 driver.findEelements（）此方法将返回WebElement类型的列表。

如何使用硒python一步一步单击以从网站获取数据

2 个答案: