Question

我想抓一个网页，我需要找如果元素的样式是 display：none;或显示：阻止，如下面的代码。（如果我看到网页的来源，我看不到任何这种风格。我知道它，因为我使用Chrome中的Inspect元素）

<p id="add_to_cart" class="buttons_bottom_block no-print" style="display: none;">
                                <button type="submit" name="Submit" class="exclusive">
                                    <span>¡Cómprame!</span>
                                </button>
                            </p>


                            <p id="add_to_cart" class="buttons_bottom_block no-print" style="display: block;">
                                <button type="submit" name="Submit" class="exclusive">
                                    <span>¡Cómprame!</span>
                                </button>
                            </p>

这是关于一家Prestashop在线商店请查看以下视频https://youtu.be/wlngNaNw1Ao 并且您将看到div oosHook更改样式显示：阻止或显示：无，但您可以在源代码上看到这一点。请检查链接 https://www.esenciadeperfume.com/bvlgari/bvlgari-man-in-black-edp.html#/6-formato-100_ml_tester

并选择一个和其他产品，您将看到更改，但如果您分析源代码，它在所有选项上看起来都是一样的。我编写了以下用于测试的python代码，它无法检测到更改：

 import urllib.request
import re
import pymysql
from bs4 import BeautifulSoup

#link1='https://www.esenciadeperfume.com/bvlgari/bvlgari-man-in-black-edp.html#/6-formato-100_ml_tester'
link1='my reputation doesn't allow'
req = urllib.request.Request(link1, headers={'User-Agent': 'Mozilla/5.0'})        
htmltext = urllib.request.urlopen(req).read()
if htmltext is None:
    print('erro')            
else:
    matches=re.findall('<div id="oosHook" style="display: block;">',str(htmltext))        
    if len(matches)==0:
        print('Not found')
    else:
        print('Found')

好的以下代码似乎可以完成这项工作

import urllib.request
import re
import pymysql
from bs4 import BeautifulSoup
from selenium import webdriver
link1='https://www.esenciadeperfume.com/bvlgari/bvlgari-man-in-black-edp.html#/6-formato-100_ml_tester'
#link1='https://www.esenciadeperfume.com/bvlgari/bvlgari-man-in-black-edp.html#/20-formato-60_ml'
browser = webdriver.Firefox()  # Your browser will open, Python might ask for permission
browser.get(link1)               # This might take a while
soup = BeautifulSoup(browser.page_source,'html.parser')
cart_style = soup.find('p', id='add_to_cart').get('style')
oos_style = soup.find('div', id='oosHook').get('style')
print('Oos_style-> '+oos_style)

问题：慢化的过程

Answer 1

我假设您知道如何发出请求并在python中获取页面源。

如果您使用BeautifulSoup，您可以搜索元素并从那里获取标签和属性。你可以有类似的东西：

from bs4 import BeautifulSoup as bs

soup = bs(souce_code)
elements = soup.find_all('p')

for e in elements:
    style = e.get('style').split(';')  # Here I'm account for multiple entries in the style
    for s in style:
        if 'display' in s:
            print s.split(':')[1]  # Prints 'none', 'block' or any other display style.

您也可以通过几种不同的方式处理这些样式，我决定保留这些样式以便理解，但您可以采用更直接的方法或使用re直接处理它。

修改

好的，您正在尝试废弃动态网页，这有点不同。您需要创建会话并等待服务器执行它需要执行的所有更改。

我在这里尝试使用selenium包成功获得了一个页面。请尝试以下方法，而不是使用简单的请求：

from selenium import webdriver """There are actually several options here, choose the one you like most (you need the browser to be installed in your pc)""" browser = webdriver.Firefox() # Your browser will open, Python might ask for permission browser.get(url) # This might take a while # And than you can keep working from here cart_style = browser.find_element_by_id('add_to_cart').get_attribute('style') oos_style = browser.find_element_by_id('oosHook').get_attribute('style')

根据@PadraicCunningham的建议，您可以使用PhantomJS驱动程序获得更快的结果，只需致电：

browser = webdriver.PhantomJS(path_to_phantom)

注意：如果PhantomJS不在您的$ PATH中，您需要提供它所在的位置。

Python scrape样式显示：无

1 个答案: