Question

我要抓取TheRegister.com安全性部分并将XML部分解析为数据结构。

在Scrapy Shell中，我尝试过：

>>> fetch('https://www.theregister.com/security/headlines.atom')

响应

2020-11-07 09:34:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.theregister.com/security/headlines.atom> (referer: None)

响应的正文可以查看，请参见下面的代码片段（我只选择了前几行）

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>tag:theregister.com,2005:feed/theregister.com/security/</id>
  <title>The Register - Security</title>
  <link rel="self" type="application/atom+xml" href="https://www.theregister.com/security/headlines.atom"/>
  <link rel="alternate" type="text/html" href="https://www.theregister.com/security/"/>
  <rights>Copyright © 2020, Situation Publishing</rights>
  <author>
    <name>Team Register</name>
    <email>webmaster@theregister.co.uk</email>
    <uri>https://www.theregister.com/odds/about/contact/</uri>
  </author>
  <icon>https://www.theregister.com/Design/graphics/icons/favicon.png</icon>
  <subtitle>Biting the hand that feeds IT — Enterprise Technology News and Analysis</subtitle>
  <logo>https://www.theregister.com/Design/graphics/Reg_default/The_Register_r.png</logo>
  <updated>2020-11-06T23:58:13Z</updated>
  <entry>
    <id>tag:theregister.com,2005:story211912</id>
    <updated>2020-11-06T23:58:13Z</updated>
    <author>
      <name>Thomas Claburn</name>
      <uri>https://search.theregister.com/?author=Thomas%20Claburn</uri>
    </author>
    <link rel="alternate" type="text/html" href="https://go.theregister.com/feed/www.theregister.com/2020/11/06/android_encryption_certs/"/>
    <title type="html">Let's Encrypt warns about a third of Android devices will from next year stumble over sites that use its certs</title>
    <summary type="html" xml:base="https://www.theregister.com/">&lt;h4&gt;Expiration of cross-signed root certificates spells trouble for pre-7.1.1 kit... unless they're using Firefox&lt;/h4&gt; &lt;p&gt;Let's Encrypt, a Certificate Authority (CA) that puts the "S" in "HTTPS" for about &lt;a target="_blank" rel="nofollow" href="https://letsencrypt.org/stats/"&gt;220m domains&lt;/a&gt;, has issued a warning to users of older Android devices that their web surfing may get choppy next year.…&lt;/p&gt; &lt;p&gt;&lt;!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --&gt;&lt;/p&gt;</summary>
  </entry>

为什么不能使用常规Xpath方法解析任何数据？我尝试过：

>>> response.xpath('entry')
[]
>>> response.xpath('/entry')
[]
>>> response.xpath('//entry')
[]
>>> response.xpath('.//entry')
[]
>>> response.xpath('entry/text()')
[]
>>> response.xpath('/entry/text()')
[]
>>> response.xpath('//entry/text()')
[]
>>> response.xpath('.//entry/text()')
[]

一切都没有运气。还有其他xml标签，例如标题，链接，我无法提取的作者。

Answer 1

TLDR；在运行response.selector.remove_namespaces()

之前执行response.xpath()

从本质上讲，这意味着您从响应中删除了xmlns="http://www.w3.org/2005/Atom"以编写更容易的XPath。或者，您可以注册名称空间并更改选择器以包括此名称空间：

response.selector.register_namespace('n', 'http://www.w3.org/2005/Atom')
response.xpath('//n:entry')

您可以阅读更多详细信息here。

Xpath没有给出结果

1 个答案: