Question

我正尝试对网页进行爬网以获得该网页的评论和评分。但是我得到的数据与输出相同。

import scrapy
import json
from scrapy.spiders import Spider


class RatingSpider(Spider):
    name = "rate"

    def start_requests(self):
        for i in range(1, 10):
            url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
            print(url)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(json.dumps({'rating': response.xpath("//div[@class='star-rating__score']").xpath("@style").extract(),
               'review': response.xpath("//p[@class='fan-reviews__item-content']/text()").getall()}))

预期：抓取网站https://www.fandango.com/aquaman-208499/movie-reviews的1000页

实际输出：

https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

Answer 1

使用JavaScript动态填充评论。在这种情况下，您必须检查网站提出的请求。

获得用户评论的URL是这样：

https://www.fandango.com/napi/fanReviews/208499/1/5

它返回带有5条评论的json。

您的蜘蛛可以这样重写：

import scrapy
import json
from scrapy.spiders import Spider


class RatingSpider(Spider):
    name = "rate"

    def start_requests(self):
        movie_id = "208499"
        for page in range(1, 10):
            # You have to pass the referer, otherwise the site returns a 403 error
            headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
            url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
            yield scrapy.Request(url=url, callback=self.parse, headers=headers)

    def parse(self, response):
        data = json.loads(response.text)
        for review in data['data']:
            yield review

请注意，我也使用yield而不是print来提取项目，这是Scrapy期望生成项目的方式。您可以像这样运行蜘蛛，将提取的项目导出到文件中：

scrapy crawl rate -o outputfile.json

搜寻网页时具有相同数据的问题

1 个答案: