xml上的xpath rss feed无法按预期工作

时间:2014-10-23 16:26:55

标签: xml xpath scrapy

尝试用scrapy(0.16)控制台解析this rss feed没有按预期工作,我不知道出了什么问题。似乎只有@href等属性可以访问:

>>> fetch('http://www2c.cdc.gov/podcasts/feed.asp?feedid=183')
2014-10-23 12:20:54-0400 [default] DEBUG: Crawled (200) <GET http://www2c.cdc.go
v/podcasts/feed.asp?feedid=183> (referer: None)
[s] Available Scrapy objects:
[s]   item       {}
[s]   request    <GET http://www2c.cdc.gov/podcasts/feed.asp?feedid=183>
[s]   response   <200 http://www2c.cdc.gov/podcasts/feed.asp?feedid=183>
[s]   settings   <CrawlerSettings module=<module 'ebola.scraper.scrape.settings'
 from 'ebola\scraper\scrape\settings.pyc'>>
[s]   spider     <BaseSpider 'default' at 0x3efc130>
[s]   xxs        <XmlXPathSelector xpath=None data=u'<feed xmlns="http://www.w3.
org/2005/Atom'>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>> xxs.select("//entry").extract()
[]
>>> xxs.select("//link").extract()
[]
>>> xxs.select("//link/text()").extract()
[]
>>> xxs.select("//title").extract()
[]
>>> xxs.select("//title/text()").extract()
[]
>>> xxs.select("//link/@href").extract()
[]
>>> xxs.select("//@href").extract()
[u'http://www2c.cdc.gov/podcasts/feed.asp?feedid=183', u'http://www.cdc.gov/medi
a/index.html', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634459', u'h
ttp://www.cdc.gov/media/releases/2014/images/p1022-post-arrival-monitoring-300x2
00.jpg', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634458', u'http://
www2c.cdc.gov/podcasts/download.asp?af=h&f=8634453', u'http://www2c.cdc.gov/podc
asts/download.asp?af=h&f=8634436', u'http://www2c.cdc.gov/podcasts/download.asp?
af=h&f=8634435', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634434', u
'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634417', u'http://www2c.cdc.
gov/podcasts/download.asp?af=h&f=8634403', u'http://www2c.cdc.gov/podcasts/downl
oad.asp?af=h&f=8634373', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=863
4367', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634365', u'http://ww
w2c.cdc.gov/podcasts/download.asp?af=h&f=8634362', u'http://www2c.cdc.gov/podcas
ts/download.asp?af=h&f=8634361', u'http://www2c.cdc.gov/podcasts/download.asp?af
=h&f=8634355', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634350', u'h
ttp://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634349', u'http://www2c.cdc.go
v/podcasts/download.asp?af=h&f=8634330', u'http://www2c.cdc.gov/podcasts/downloa
d.asp?af=h&f=8634329', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=86343
28', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634325', u'http://www2
c.cdc.gov/podcasts/download.asp?af=h&f=8634324', u'http://www2c.cdc.gov/podcasts
/download.asp?af=h&f=8634322', u'http://www2c.cdc.gov/podcasts/download.asp?af=h
&f=8634283', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634278', u'htt
p://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634277', u'http://www2c.cdc.gov/
podcasts/download.asp?af=h&f=8634273', u'http://www2c.cdc.gov/podcasts/download.
asp?af=h&f=8634265', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634262
', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634250', u'http://www2c.
cdc.gov/podcasts/download.asp?af=h&f=8634251', u'http://www.cdc.gov/media/DPK/20
14/images/vs-crash-injuries/fb.jpg', u'http://www2c.cdc.gov/podcasts/download.as
p?af=h&f=8634248', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634234',
 u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634233', u'http://www2c.cd
c.gov/podcasts/download.asp?af=h&f=8634225', u'http://www2c.cdc.gov/podcasts/dow
nload.asp?af=h&f=8634224', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8
634222', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634221', u'http://
www2c.cdc.gov/podcasts/download.asp?af=h&f=8634323', u'http://www2c.cdc.gov/podc
asts/download.asp?af=h&f=8634217', u'http://www2c.cdc.gov/podcasts/download.asp?
af=h&f=8634214', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634178', u
'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634174', u'http://www.cdc.go
v/media/images/L2/p1002-smoke-free-housing.jpg', u'http://www2c.cdc.gov/podcasts
/download.asp?af=h&f=8634173', u'http://www2c.cdc.gov/podcasts/download.asp?af=h
&f=8634211', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634164', u'htt
p://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634157', u'http://www2c.cdc.gov/
podcasts/download.asp?af=h&f=8634160', u'http://www2c.cdc.gov/podcasts/download.
asp?af=h&f=8634161', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634146
', u'http://www2c.cdc.gov/podcasts/download.asp?af=h&f=8634073']
>>>

请记住改变版本的scrapy不是一个选项,我锁定为0.16任何想法都赞赏...

1 个答案:

答案 0 :(得分:2)

当您在浏览器中查看HTML源代码时,您会看到源XML位于默认命名空间中:

<feed xmlns="http://www.w3.org/2005/Atom">

feed的所有后代元素也属于此命名空间 - 这就是您的选择器不会产生任何结果的原因。除了选择属性的那个:

  

似乎只有@href等属性可以访问

因为属性不采用默认命名空间 - 并且没有名称空间。


如果您想访问命名空间中的元素,则必须首先注册所述命名空间,并为其选择前缀:

xxs.register_namespace("atom", "http://www.w3.org/2005/Atom")

然后,使用atom:(或任何其他前缀)为元素添加前缀:

xxs.select("//atom:link").extract()

relevant section of the Scrapy documentation

中查找更多信息
相关问题