Question

假设我有一个主页index.html和四个子页1.html … 4.html。所有页面都以相同的方式链接在主页面上。

如何使用Python scrapy跟踪这些特定链接，并按照重复模式删除内容。

以下是设置：

的index.html

<body>
<div class="one"><p>Text</p><a href="1.html">Link 1</a></div>
…
<div class="one"><p>Text</p><a href="4.html">Link 4</a></div>
</body>

1.HTML ... 4.html

<body>
<div class="one"><p>Text to be scraped</p></div>
</body>

如何在scrapy中设置spider只关注从index.html提取的链接？

我觉得教程中的例子对我没什么帮助：

来自scrapy.spider的

导入Spider

class IndexSpider(Spider):
    name = "index"
    allowed_domains = ["???"]
    start_urls = [
        "index.html"
    ]

注意：这是一个简化的例子。在原始示例中，所有网址都来自网络，index.html包含的链接比1…4.html更多。

问题是如何遵循extact链接，可以作为列表提供，但最终将来自xpath选择器 - 从表中选择最后一列，但只是每隔一行。

Answer 1

使用CrawlSpider并指定SmglLinkExtractor的规则：

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
    name = "mydomain"
    allowed_domains = ["www.mydomain"]
    start_urls = ["http://www.mydomain/index.html",]

    rules = (Rule(SgmlLinkExtractor(allow=('\d+.html$', ),), callback="parse_items", follow=True), )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        # get the data

如何使用scrapy关注特定链接和抓取内容？

1 个答案: