来自网站

时间:2021-04-12 12:28:26

标签: python r web-scraping python-requests httr

我想从以下网站抓取一张表格:https://www.katastar.hr

要关注我想要的,请打开检查,而不是点击网络。 现在,当您打开站点时,您可以看到有一个 URL: https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined

问题是每次打开网站id和status都不一样。 当每次都有不同的 GET 查询时,我如何抓取上述请求的输出(这是一个 json,那是一个表)?

我会给出可重复的例子,但我没有什么特别的可以尝试。我应该从主页开始,但我不知道如何继续:

headers <- c(
  "Accept" = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding' = "gzip, deflate, br",
  'Accept-Language' = 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
  "Cache-Control" = "max-age=0",
  "Connection" = "keep-alive",
  "DNT" = "1",
  "Host" = "www.katastar.hr",
  "If-Modified-Since" = "Mon, 22 Mar 2021 13:39:38 GMT",
  "Referer" = "https://www.google.com/",
  "sec-ch-ua" = '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
  "sec-ch-ua-mobile" = "?0",
  "Sec-Fetch-Dest" = "document",
  "Sec-Fetch-Mode" = "navigate",
  "Sec-Fetch-Site" = "same-origin",
  "Sec-Fetch-User" = "?1",
  "Upgrade-Insecure-Requests" = "1",
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
)
p <- httr::GET(
  "https://www.katastar.hr/",
  add_headers(headers))
httr::cookies(p)

代码可以在 R 和 python 中。

1 个答案:

答案 0 :(得分:2)

您只需要 http 标头 Origin 即可使其工作:

  • 蟒蛇
import requests

r = requests.get(
    "https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
    headers={
        "Origin": "https://www.katastar.hr"
    })

print(r.json())

repl.it:https://replit.com/@bertrandmartel/ScrapeKatastar

  • R
library(httr)

data <- content(GET(
  "https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
  add_headers(origin = "https://www.katastar.hr")
  ), as = "parsed", type = "application/json")

print(data)

为了进一步了解网站如何生成 idstatus,JS 中有以下代码:

e.prototype.getSurveyors = function(e) {
    var t = this.runbase(),
      n = this.create(t.toString(), null);
    return this.httpClient.get(s + "/position", {
      params: {
        id: t.toString(),
        status: n,
        x: String(e[0]),
        y: String(e[1])
      }
    })
}
e.prototype.runbase = function() {
    return Math.floor(1e7 * Math.random())
}
e.prototype.create = function(e, t) {
    for (var n = 0, i = 0; i < e.length; i++) n = (n << 5) - n + e.charAt(i).charCodeAt(0), n &= n;
    return null == t && (t = e), Math.abs(n).toString().substring(0, 6) + (Number(t) << 1)
}

它需要一个随机数 id 并使用特定算法对其进行编码,并将结果放入 status 字段中。然后服务器检查 status 编码值是否与 id 值匹配。

似乎之前的 id 值仍然像上面的示例一样工作(如果没有发送数据),但您也可以像这样重现上面的 JS 函数( 中的示例):

from random import randint
import ctypes
import requests

number = randint(1000000, 9999999)

def encode(rand, data):
    randStr = str(rand)
    n = 0
    for char in randStr:
        n = ctypes.c_int(n << 5).value - n + ord(char)
    n = ctypes.c_int(n & n).value
    if data is None:
        suffix = ctypes.c_int(rand << 1).value
    else:
        suffix = ctypes.c_int(data << 1).value
    return f"{str(abs(n))[:6]}{suffix}"

r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position",
                 params={
                     "id": number,
                     "status": encode(number, None)
                 },
                 headers={
                     "Origin": "https://www.katastar.hr"
                 })
print(r.json())

# GET parcel Id 13241901
parcelId = 13241901
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/parcelInfo",
                 params={
                     "id": number,
                     "status": encode(number, parcelId)
                 },
                 headers={
                     "Origin": "https://www.katastar.hr"
                 })
print(r.json())

repl.it:https://replit.com/@bertrandmartel/ScrapeKatastarDecode