Question

我正试图从我的uni网站上抓取一些数据，我正在使用请求和lxml |那个html。我曾经使用过beautifulsoup4，但它的使用速度不够快

这是我第一次使用lxml，我收到了这个错误：

from lxml import html
import requests
import json
import logging

url = 'https://example.com/' 
url_ajax = "https://example.com//webapps/portal/execute/tabs/tabAction"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'X-Requested-With': 'XMLHttpRequest'
}  
#data of url link
    'payload = {
    'user_id': 'myid',
    'password': 'mypass'
}

#data of cources (ajax call) 

course_data = {
    'action' : 'refreshAjaxModule',
    'modId' : '_27_1',
    'tabId' : '_1_1' , 
    'tab_tab_group_id' : '_1_1' 
}
# make sure that links are working fine
# Enabling debugging at http.client level (requests->urllib3->http.client)
# you will see the REQUEST, including HEADERS and DATA, and RESPONSE with HEADERS but without DATA.
# the only thing missing will be the response.body which is not logged.
"""try: # for Python 3
    from http.client import HTTPConnection
except ImportError:
    from httplib import HTTPConnection
HTTPConnection.debuglevel = 1
logging.basicConfig() 
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
"""
# start the script 
session = requests.Session()
#go to the root url and post the username and password 
session.post(url ,headers=headers,data=payload ) 
# get the data of cources 
urlajax = session.post(url_ajax , headers=headers, data= course_data)   #get the ajax call

page = requests.get(urlajax)
page.json()   # This *call* raises an exception if JSON decoding fails
# here is my error 
content = page.content
tree = html.fromstring(content)
ga = tree.xpath('//div[@id="div_27_1"]//div[@id="_27_1termCourses__8_1"]/ul/li[1]/a/text()')
print(ga)

这是我的错误：

File "scrape.py", line 56, in <module>
     page = requests.get(urlajax)
 File "C:\Users\HozRifai\Desktop\WEBSCR~1\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
File "C:\Users\HozRifai\Desktop\WEBSCR~1\lib\site-packages\requests\api.py", line 58, in request
        return session.request(method=method, url=url, **kwargs)
File "C:\Users\HozRifai\Desktop\WEBSCR~1\lib\site-packages\requests\sessions.py", line 494, in request
        prep = self.prepare_request(req)
File "C:\Users\HozRifai\Desktop\WEBSCR~1\lib\site-packages\requests\sessions.py", line 437, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Users\HozRifai\Desktop\WEBSCR~1\lib\site-packages\requests\models.py", line 305, in prepare
        self.prepare_url(url, params)
File "C:\Users\HozRifai\Desktop\WEBSCR~1\lib\site-packages\requests\models.py", line 379, in prepare_url
        raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '<Response [200]>': No schema supplied. Perhaps you meant http://<Response [200]>?

Answer 1

检查此行：

url_ajax = "https://example.com//webapps/portal/execute/tabs/tabAction"

您确定要在根网址和/ webapps部分之间加入两个//吗？

Answer 2

错误的重要部分是最后一行：

requests.exceptions.MissingSchema：无效的网址''：未提供架构。也许你的意思是http://<Response [200]>

这一行正在发生

page = requests.get(urlajax)

我认为urlajax不是正确的类型，它会变成字符串"<Response 200>"。

我不知道你要做什么 - 如果你想要来自该响应的信息，你需要查看urlajax响应对象。对象本身不仅仅包含返回的有效负载。

request.get（）不起作用

2 个答案: