Question

我已经在python中创建了一个脚本，以仅从遍历多个页面的网站中抓取指向不同餐厅的链接。通过查看位于右上角的特定文本，我可以看到有多少链接：

显示18891中的1-30

但是我不能手动或使用脚本通过this link。该网站在每次分页时会将其内容增加30。

到目前为止，我已经尝试过：

import requests
from bs4 import BeautifulSoup

link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start={}'

for page in range(960,1920,30): # modified the range to reproduce the issue

    resp = requests.get(link.format(page),headers={"User-Agent":"Mozilla/5.0"})

    print(resp.status_code,resp.url)

    soup = BeautifulSoup(resp.text, "lxml")
    for items in soup.select("li[class^='lemon--li__']"):

        if not items.select_one("h3 > a[href^='/biz/']"):continue
        lead_link = items.select_one("h3 > a[href^='/biz/']").get("href")
        print(lead_link)

上面的脚本仅从其landing page中获取链接。

我如何也可以从其他页面获得链接？

Answer 1

该页面之后没有数据。

您的代码应修改为以下内容-

import requests
from bs4 import BeautifulSoup

link = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start={}"

for page in range(0, 960, 30):  # modified the range to reproduce the issue

    resp = requests.get(link.format(page), headers={"User-Agent": "Mozilla/5.0"})

    print(resp.status_code, resp.url)

    soup = BeautifulSoup(resp.text, "lxml")
    for items in soup.select("li[class^='lemon--li__']"):

        if not items.select_one("h3 > a[href^='/biz/']"):
            continue
        lead_link = items.select_one("h3 > a[href^='/biz/']").get("href")
        print(lead_link)

Answer 2

Yelp故意阻止您执行此操作，试图避开您正在做的事情，因为我希望很多人尝试为其网站编写爬虫。

https://www.yelp.com/robots.txt甚至有一个异想天开的介绍，并特别提到了抓取，因此您应该与他们联系。

因此，如果您确实需要数据，请与他们联系，或者尝试其他可能会漏掉缝隙的事情，如评论中建议的对郊区进行过滤。

无论如何，简单的答案是yelp不允许您尝试执行的操作，因此，这种方式是不可能的。

无法使我的脚本仅从顽固的网站中获取下一页的链接

2 个答案: