我正在尝试从该网站 (topuniversity) 中抓取100 所大学的列表。
使用=IMPORTXML("https://www.topuniversities.com/university-rankings/usa-rankings/2021","//*[@id='ranking-data-load']/div[1]/div/div/div/div[2]")
显示错误:Imported content is empty.
如何使用 xpath 来获取所需的数据?
答案 0 :(得分:2)
我在开发者工具中发现了这个 xhr 请求
<块引用>https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3738856.txt?1622189434?v=1622361479157
除非呈现 JavaScript,否则您的 xpath 将无法工作
为了做到这一点,你有两个选择
selenium / webbrowser(需要 webdriver)chrome 或 Firefox 等
收集适当的标头和数据以通过请求模块发送请求
和代码
import requests
URL = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3738856.txt?1622189434?v=1622361479157'
headers = {
"Host": "www.topuniversities.com",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux armv8l; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Referer": "https://www.topuniversities.com/university-rankings/usa-rankings/2021",
"X-Requested-With": "XMLHttpRequest",
"via": "1.1 google"
}
datas = requests.get(URL, headers=headers).json()
import re
for i in datas['data']:
for j in re.findall('class="uni-link">(.*)</a>',i['title']):
print(j)
结果
Harvard University
Stanford University
Massachusetts Institute of Technology (MIT)
University of California, Berkeley (UCB)
University of California, Los Angeles (UCLA)
Yale University