Question

我有这个嵌套的站点地图。 Scrapy docs表示它应该可以与嵌套站点地图一起使用，而不会出现任何问题。我的目标链接就是这样的https://flatinfo.ru/arenda_kvartir.asp?id=867039 因此，据我所知sitemap_rules，通过链接（'/arenda_kvartir/'）来确定关键字应该使抓取工具的行为符合以下逻辑：在sitemap.xml中找到并包含来自sitemap_rules的关键字的所有链接都应粘贴到parsed函数中。但是根据日志，这永远不会发生。蜘蛛只会浏览站点地图中的所有主要类别，然后退出。我哪里错了？下面是我的代码。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import SitemapSpider

class CodeSpider(SitemapSpider):
    name = 'code_s'
    sitemap_urls = ['https://flatinfo.ru/sitemap.xml']
    sitemap_rules = [
        ('/arenda_kvartir/', 'parsed'),
        ('/sitemap_prodaja_kvartir/', 'parsed'),
    ] 


    def parsed(self, response):
        yield {

                    }

Answer 1

您需要在https://flatinfo.ru/arenda.asp?house=43182中匹配两种URL：https://flatinfo.ru/prodaja_kvartir.asp?id=17488515和sitemap_rules。正确 sitemap_rules = [ ('arenda.asp', 'parsed'), ('prodaja_kvartir.asp', 'parsed'), ] sitemap_follow = ['sitemap_prodaja_kvartir', 'sitemap_arenda']为：

sitemap_follow

（我添加了capture() { { console.log("Inside capture") var data = document.getElementById('entireTable'); console.log("found the element id") html2canvas(data).then(canvas => { // Few necessary setting options data.appendChild(canvas); var imgWidth = 308; var pageHeight = 2295; var imgHeight = canvas.height * imgWidth / canvas.width; var heightLeft = imgHeight; console.log("Height width set"); const contentDataURL = canvas.toDataURL('image/png') // let pdf = new jspdf('p', 'mm', 'a1'); // A4 size page of PDF const pdf = new jspdf({ orientation: 'landscape', }); var position = 0; pdf.addImage(contentDataURL, 'PNG', 0, position, imgWidth, imgHeight); console.log("pdf to be made") pdf.save('Mapping.pdf'); // Generated PDF console.log("pdf made") }); } }来跳过其他站点地图条目）。

还有一件事情：您需要等待（直到第一个处理的URL差不多20分钟！），因为每个XML文件都需要大量时间来处理。

拼凑的Sitemap蜘蛛未提供预期的结果

1 个答案: