Question

例如，要阅读RSS Feed，由于在项目＆＃39;之前插入了愚蠢的{http://purl.org ...}命名空间，因此无法正常工作：

#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import urllib, urllib.request

url = "http://some/rss/feed"
response = urllib.request.urlopen(url)
xml_text = response.read().decode('utf-8')
xml_root = ET.fromstring(xml_text)
for e in xml_root.findall('item'):
  print("I found an item!")

现在，由于{}前缀，findall（）已经变得无用，这是另一种解决方案，但这很难看：

#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import urllib, urllib.request

url = "http://some/rss/feed"
response = urllib.request.urlopen(url)
xml_text = response.read().decode('utf-8')
xml_root = ET.fromstring(xml_text)
for e in xml_root:
  if e.tag.endswith('}item'):
    print("I found an item!")

我可以让ElementTree删除所有前缀吗？

Answer 1

您需要处理名称空间，如下所述：

Parsing XML with namespace in Python via 'ElementTree'

但是，如果相反，您将使用专门的库来阅读RSS源，例如feedparser：

>>> import feedparser
>>> url = "http://some/rss/feed"
>>> feed = feedparser.parse(url)

虽然我个人会使用XMLFeedSpider Scrapy spider。作为奖励，您将获得所有其他Scrapy web-scraping framework features。

如何遍历XML树而不必担心Python中的名称空间前缀？

1 个答案: