从lxml树中提取数据

时间:2017-06-02 11:36:43

标签: python xpath lxml

序言

I followed this guide

遗憾的是,它并不完全有效,因此我无法从lxml树中提取出我想要的数据。我对这个具体案例并不特别感兴趣;我正在寻找更一般的答案。

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit() 

url = 'http://pycoders.com/archive/'  
#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)

指南继续:

archive_links = tree.xpath('//divass="campaign"]/a/@href')

会导致错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src\lxml\lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src\lxml\lxml.etree.c:59353)
  File "src\lxml\xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:171227)
  File "src\lxml\xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:170184)
lxml.etree.XPathEvalError: Invalid expression

问题

要访问我的数据,我仍然需要使用正确的xpath。为了测试,我尝试使用title = tree.xpath('//title'). 这给我留下了<element title at 0xdf418>个对象。我无法从此对象中提取数据,即本例中的标题。

我尝试了一些事情,但实际上没有人返回数据。

>>> title .__len__()
1
>>> title .__sizeof__()
72
>>> type(title)
<type 'list'>
>>>title[0]
<element title at 0xdfc418>

1 个答案:

答案 0 :(得分:1)

可能有一个错字。试试这个:

archive_links = tree.xpath('//div[class="campaign"]/a/@href')

或者:

archive_links = tree.xpath('//div[@class="campaign"]/a/@href')
相关问题