我要抓取TheRegister.com安全性部分并将XML部分解析为数据结构。
在Scrapy Shell中,我尝试过:
>>> fetch('https://www.theregister.com/security/headlines.atom')
响应
2020-11-07 09:34:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.theregister.com/security/headlines.atom> (referer: None)
响应的正文可以查看,请参见下面的代码片段(我只选择了前几行)
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>tag:theregister.com,2005:feed/theregister.com/security/</id>
<title>The Register - Security</title>
<link rel="self" type="application/atom+xml" href="https://www.theregister.com/security/headlines.atom"/>
<link rel="alternate" type="text/html" href="https://www.theregister.com/security/"/>
<rights>Copyright © 2020, Situation Publishing</rights>
<author>
<name>Team Register</name>
<email>webmaster@theregister.co.uk</email>
<uri>https://www.theregister.com/odds/about/contact/</uri>
</author>
<icon>https://www.theregister.com/Design/graphics/icons/favicon.png</icon>
<subtitle>Biting the hand that feeds IT — Enterprise Technology News and Analysis</subtitle>
<logo>https://www.theregister.com/Design/graphics/Reg_default/The_Register_r.png</logo>
<updated>2020-11-06T23:58:13Z</updated>
<entry>
<id>tag:theregister.com,2005:story211912</id>
<updated>2020-11-06T23:58:13Z</updated>
<author>
<name>Thomas Claburn</name>
<uri>https://search.theregister.com/?author=Thomas%20Claburn</uri>
</author>
<link rel="alternate" type="text/html" href="https://go.theregister.com/feed/www.theregister.com/2020/11/06/android_encryption_certs/"/>
<title type="html">Let's Encrypt warns about a third of Android devices will from next year stumble over sites that use its certs</title>
<summary type="html" xml:base="https://www.theregister.com/"><h4>Expiration of cross-signed root certificates spells trouble for pre-7.1.1 kit... unless they're using Firefox</h4> <p>Let's Encrypt, a Certificate Authority (CA) that puts the "S" in "HTTPS" for about <a target="_blank" rel="nofollow" href="https://letsencrypt.org/stats/">220m domains</a>, has issued a warning to users of older Android devices that their web surfing may get choppy next year.…</p> <p><!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --></p></summary>
</entry>
为什么不能使用常规Xpath方法解析任何数据?我尝试过:
>>> response.xpath('entry')
[]
>>> response.xpath('/entry')
[]
>>> response.xpath('//entry')
[]
>>> response.xpath('.//entry')
[]
>>> response.xpath('entry/text()')
[]
>>> response.xpath('/entry/text()')
[]
>>> response.xpath('//entry/text()')
[]
>>> response.xpath('.//entry/text()')
[]
一切都没有运气。还有其他xml标签,例如标题,链接,我无法提取的作者。
答案 0 :(得分:1)
TLDR;在运行response.selector.remove_namespaces()
response.xpath()
从本质上讲,这意味着您从响应中删除了xmlns="http://www.w3.org/2005/Atom"
以编写更容易的XPath。
或者,您可以注册名称空间并更改选择器以包括此名称空间:
response.selector.register_namespace('n', 'http://www.w3.org/2005/Atom')
response.xpath('//n:entry')
您可以阅读更多详细信息here。