Question

我正在尝试使用chrome webdriver用硒刮掉Instagram。我需要获得XHR响应信息，我试过＆＃34; browsermob-proxy＆＃34;而那些信息还不够：

server = Server("/home/doruk/Downloads/browsermob-proxy 2.1.4/bin/browsermob-proxy")
server.start()
time.sleep(1)
proxy = server.create_proxy()
time.sleep(1)

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--proxy-server={0}".format(proxy.proxy)) 
browser = webdriver.Chrome(chrome_options=chrome_options)

##############################################
####This is output of proxy.har in json format.
 {
    "comment": "", 
    "serverIPAddress": "155.245.9.55", 
    "pageref": "", 
    "startedDateTime": "2018-05-21T16:44:41.053+03:00", 
    "cache": {}, 
    "request": {
      "comment": "", 
      "cookies": [], 
      "url": "https://scontent-sof1-1.cdninstagram.com/vp/e95312434013bc43a5c00c458b53022cb/5BC46751/t51.2885-19/s150x150/26432586_139925760144086_726193654523232256_n.jpg", 
      "queryString": [], 
      "headers": [], 
      "headersSize": 528, 
      "bodySize": 0, 
      "method": "GET", 
      "httpVersion": "HTTP/1.1"
    },

当我点击＆＃34;加载更多评论＆＃34;在内容中，链接类似于此

https://www.instagram.com/graphql/query/?query_hash=33ba35000cb50da46f5b5e889df7d159&variables=%7B＆＃34;短代码＆＃34;％3A＆＃34; Bi9ZURdA6Gn＆＃34;％2C＆＃34;第一＆＃34;％3A36％2C＆＃34;后＆＃34;％3A＆＃34; AQBr-wP7U4Ykr1QRH7PYJ1a0KQivhS0Ndwae-5F8vrZ5sf1eA_Bfgn4dZ0ql0pwUf9GXPm_LPyhtCnlhH6YOHfuNstwXK9VZuUIR4zD3k24s6Q＆＃34;％7D

显示，我需要其中的信息。有办法处理这种情况吗？

我只需要＆＃34;？query_hash =＆＃34;的事情。

Example view

Answer 1

我已经做到了！我的窍门就是等待页面的整个加载。对我而言，不是DOM就绪状态页面会继续加载。有一种方法可以消除任意睡眠，并向驱动程序询问页面的真正完整加载。我不记得代码了……我必须搜索。

from browsermobproxy import Server
import json
from selenium import webdriver
import time

urle = "https://www.yoururl.com";

server = Server(path="./browsermob-proxy-2.1.4/bin/browsermob-proxy")
server.start()
proxy = server.create_proxy()
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile, executable_path='./geckodriver')
proxy.new_har(urle, options={'captureHeaders': True, 'captureContent':True})
driver.get(urle)
time.sleep(10)
result = json.dumps(proxy.har, ensure_ascii=False)
print result
proxy.stop()
driver.quit()

硒阅读XHR反应

1 个答案: