Question

我有一个xml页面，其结构如下：

<item>
  <pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
  <title>
   some text
  </title>
  <link>
     http://www.example.com/index.xml
  </link>
  ...

我想在<links>标记中提取并关注链接。

我只有默认代码：

start_urls = ['example.com/example.xml']

rules = (
    Rule(LinkExtractor(allow="example.com"),
          callback='parse_item',),
)

但我不知道如何关注“text”标签。我实际上尝试过linkextractor tags='links'选项，但无济于事。该日志有效地转到该页面，得到200回复，但没有得到任何链接。

Answer 1

这里的关键问题是，这不是常规的HTML输入，而是XML提要，链接是在元素文本中，而不是属性。我想你只需要XMLFeedSpider：

import scrapy
from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):
    name = 'myspider'
    start_urls = ['url_here']

    itertag = "item"

    def parse_node(self, response, node):
        for link in node.xpath(".//link/text()").extract():
            yield scrapy.Request(link.strip(), callback=self.parse_link)

    def parse_link(self, response):
        print(response.url)

Answer 2

您应该使用xml.etree库。

import xml.etree.ElementTree as ET



res = '''
<item>
  <pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
  <title>
   some text
  </title>
  <link>
     http://www.example.com/index.xml
  </link>
</item>
'''

root = ET.fromstring(res)
results = root.findall('.//link')
for res in results:
    print res.text

输出如下：

http://www.example.com/index.xml

使用scrapy从xml中提取链接

2 个答案: