Question

我希望使用无头浏览器来抓取某些网站，并且需要使用代理服务器。

我迷路了，正在寻求帮助。

当我禁用代理时，它每次都能正常运行。

当我禁用无头模式时，我会得到一个空白的浏览器窗口，但是如果我在具有“ https://www.whatsmyip.org”的URL栏上按Enter键，则会加载页面（使用显示不同IP的代理服务器）。

其他网站也有同样的错误，不仅仅是whatsmyip.org产生了此结果。

我正在运行Centos7，Python 3.6和Selenium 3.14.0。

我也在运行Anaconda的Windows机器上进行了尝试，并且结果相同。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy, ProxyType

my_proxy = "x.x.x.x:xxxx" #I have a real proxy address here
proxy = Proxy({
    'proxyType': ProxyType.MANUAL,
    'httpProxy': my_proxy,
    'ftpProxy': my_proxy,
    'sslProxy': my_proxy,
    'noProxy': ''
})

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--allow-insecure-localhost')
chrome_options.add_argument('--allow-running-insecure-content')
chrome_options.add_argument("--ignore-ssl-errors");
chrome_options.add_argument("--ignore-certificate-errors");
chrome_options.add_argument("--ssl-protocol=any");        
chrome_options.add_argument('--window-size=800x600')
chrome_options.add_argument('--disable-application-cache')

capabilities = dict(DesiredCapabilities.CHROME)
proxy.add_to_capabilities(capabilities)
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True

browser = webdriver.Chrome(executable_path=r'/home/glen/chromedriver', chrome_options=chrome_options, desired_capabilities=capabilities)

browser.get('https://www.whatsmyip.org/')

print(browser.page_source)     

browser.close()

运行代码时，返回以下内容：

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>

不是网站。

Answer 1

这里有两个问题：

您需要等待浏览器加载网站。
browser.page_source不会返回您想要的内容。

第一个问题是通过等待元素出现在DOM中来解决的。通常，您将需要刮擦某些东西，因此您知道如何识别元素。添加代码以等待该元素存在。

第二个问题是page_source不返回当前DOM，而是返回浏览器加载的初始HTML。如果从那时起JavaScript修改了页面，您将不会看到这种方式。

解决方案是找到html元素并要求提供outerHtml属性：

from selenium.webdriver.common.by import By
htmlElement = driver.find_element(By.TAG_NAME, "html")
dom = htmlElement.getAttribute("outerHtml")
print(dom)

有关详细信息，请参见以下示例：https://www.seleniumhq.org/docs/03_webdriver.jsp#introducing-the-selenium-webdriver-api-by-example

使用代理服务器时，无头Chrome返回空HTML

1 个答案: