Question

Hia的

我在python中从stackexchange解析rss feed时遇到问题。当我尝试获取摘要节点时，返回一个空列表

我一直试图解决这个问题，但无法理解。

任何人都可以帮忙吗？谢谢一个

In [3o]: import lxml.etree, urllib2



In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' 

In [32]: cooking_content = urllib2.urlopen(url_cooking)

In [33]: cooking_parsed = lxml.etree.parse(cooking_content)

In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')

In [35]: cooking_texts
Out[35]: []

In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' In [32]: cooking_content = urllib2.urlopen(url_cooking) In [33]: cooking_parsed = lxml.etree.parse(cooking_content) In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary') In [35]: cooking_texts Out[35]: []

Answer 1

看看这两个版本

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

正如您所发现的，第二个版本不返回任何节点，但lxml.html版本可以正常工作。 etree版本无效，因为它期望命名空间，而html版本正在运行，因为它忽略了命名空间。部分向下http://lxml.de/lxmlhtml.html，它说“HTML解析器显着忽略了命名空间和其他一些XML主义。”

注意当您打印etree版本（print(data.getroot())）的根节点时，会得到类似<Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>的内容。这意味着它是一个名称空间为http://www.w3.org/2005/Atom的feed元素。以下是etree代码的更正版本。

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}

data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

Answer 2

问题是命名空间。

运行：

 cooking_parsed.getroot().tag

你会看到该元素被命名为

{http://www.w3.org/2005/Atom}feed

同样，如果您导航到其中一个Feed条目。

这意味着lxml中的正确xpath是：

print cooking_parsed.xpath(
  "//a:feed/a:entry",
  namespaces={ 'a':'http://www.w3.org/2005/Atom' })

Answer 3

尝试使用beautifulsoup导入中的BeautifulStoneSoup。它可能会成功。

lxml - 难以解析stackexchange rss feed

3 个答案: