从网站抓取数据时 IMPORTXML 显示错误

时间:2021-05-30 06:50:48

标签: web-scraping google-sheets google-sheets-formula

我正在尝试从该网站 (topuniversity)抓取100 所大学的列表。

使用=IMPORTXML("https://www.topuniversities.com/university-rankings/usa-rankings/2021","//*[@id='ranking-data-load']/div[1]/div/div/div/div[2]")

显示错误:Imported content is empty.

如何使用 xpath 来获取所需的数据?

1 个答案:

答案 0 :(得分:2)

我在开发者工具中发现了这个 xhr 请求

<块引用>

https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3738856.txt?1622189434?v=1622361479157

除非呈现 JavaScript,否则您的 xpath 将无法工作

为了做到这一点,你有两个选择

  • selenium / webbrowser(需要 webdriver)chrome 或 Firefox 等

  • 收集适当的标头和数据以通过请求模块发送请求

和代码

import requests

URL = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3738856.txt?1622189434?v=1622361479157'


headers = {
   "Host": "www.topuniversities.com",
   "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux armv8l; rv:88.0) Gecko/20100101 Firefox/88.0",
   "Accept": "application/json, text/javascript, */*; q=0.01",
   "Accept-Language": "en-US,en;q=0.5",
   "Accept-Encoding": "gzip, deflate",
   "Referer": "https://www.topuniversities.com/university-rankings/usa-rankings/2021",
   "X-Requested-With": "XMLHttpRequest",
   "via": "1.1 google"
}

datas = requests.get(URL, headers=headers).json()
import re

for i in datas['data']:
    for j in re.findall('class="uni-link">(.*)</a>',i['title']):
        print(j)

结果

Harvard University
Stanford University
Massachusetts Institute of Technology (MIT)
University of California, Berkeley (UCB)
University of California, Los Angeles (UCLA)
Yale University
相关问题