Question

我正在使用 Scrapy 来抓取和抓取网站上的数据，主要包括html页面和pdf文件（我已修改IGNORED_EXTENSIONS以允许抓取pdf）。

我需要提取陷入<a>标记之间的文字：

<a href='some_document.pdf'>I need this text</a>

显然，我不能做response.text或response.css，因为只有字节需要读取（你得到一个AttributeError）。

我想到的一件事就是抓取页面，从该页面中提取所有链接并将其保存在文本文件中。它工作，除了我最终有很多重复链接，链接被破坏（想想403,404,500）或许多我不关心的链接。我认为必须有更好的方法！

在阅读Scrapy文档时，我偶然发现了LxmlLinkExtractor的文档。在“构造函数”中，它有2个有趣的字段：

标签（str或list） - 提取链接时要考虑的标签或标签列表。默认为（'a'，'area'）。
attrs（list） - 查找要提取的链接时应考虑的属性或属性列表（仅适用于tags参数中指定的那些标记）。默认为（'href'，）

这让我想到在抓取它之前是否有可能获取<a>元素的属性值。我对么？如果是，我如何在标签之间抓取文字？

源代码：

class ArchiveSpider(CrawlSpider):

...some code...

rules = [
        Rule(LinkExtractor(allow=[re.compile('pdf', re.IGNORECASE)]), 
                           callback='parse_pdf', 
                           follow=True),
        Rule(LinkExtractor(), callback='parse_item', follow=True)
    ]

    def parse_pdf(self, response):
        yield dict(url=response.url)

    def parse_item(self, response):
        if re.search(re.compile('pdf', re.IGNORECASE, response.headers.get('Content-Type').decode('utf-8')):
            parse_pdf(self, response)
        title = response.css('title::text').extract()[0].strip() if response.css('title::text') else ''
        yield dict(title=title, url=response.url)

Answer 1

我不确定我是否理解你，但你不能做到这样的事情：

text = response.xpath('//a/text()').extract()

您必须在xpath中指定所需的<a> - 元素，text()选择标记之间的文本。

从导致pdf文件的锚元素中提取信息

1 个答案: