scrapy xpath没有返回所需的结果。任何的想法?

时间:2015-06-11 15:09:52

标签: html xpath scrapy

请查看此页http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15845。正如您所猜测的那样,我正在尝试抓取此页面上的所有字段。除“答案”字段外,所有字段都适当地生成。我觉得奇怪的是,问题和答案的页面结构几乎相同(表[1]和表[2]);问题完全刮擦但答案没有。这是我的xpaths:

问题:

['q_main'] = Selector(response).xpath('//*[@id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[1]/tbody/tr/td/text()').extract()

完美无缺

答案:

['q_answer'] = Selector(response).xpath('//*[@id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[2]/tbody/tr[2]/td/text()').extract()

返回空白。我已经复制了完整的xpath,由Xpath Helper和控制台中的/ Verified返回。 我在俯瞰什么?我什么都看不到?

2 个答案:

答案 0 :(得分:0)

好像你的xpath有问题,

从scrapy shell查看演示,

In [1]: response.xpath('//tr[td[@class="mainheaderq" and contains(font/text(), "ANSWER")]]/following-sibling::tr/td[@class="griditemq"]//text()').extract()
Out[1]: 
[u'\r\n\r\n',
 u'MINISTER OF STATE(I/C) FOR COAL, POWER AND NEW & RENEWABLE ENERGY   (SHRI PIYUSH GOYAL)\r\n\r\n ',
 u'(a) & (b): So far 29 coal mines have been auctioned under the provisions of Coal Mines (Special Provisions) \r\nAct, 2015 and the Rules made thereunder. The auction process for non-regulated sector viz. Iron and Steel, \r\nCement and Captive Power was based on forward bidding process where bidders had to submit their final price \r\noffer above the applicable floor price. In case of Power sector which is a regulated one, reverse bidding \r\nmethodology was adopted where bidders had to submit bids below the applicable ceiling price, which shall be \r\ntaken as fuel cost in determination of power tariff. In case, bid price reaches Rs. zero in reverse bidding, \r\nthe bidding is based on additional premium payable to the concerned State Government, over and  above  the  \r\nfixed  reserve  price  of  Rs. 100/-  per  tonne.\r\n\r\n',
 u'\r\nRevenue which would accrue to the coal bearing State Government concerned comprises of Upfront payment \r\nas prescribed in the tender document, Auction proceeds and Royalty on per tonne of coal production. State-wise \r\ndetails of 29 coal mines auctioned so far along-with specified end-uses and estimated revenue which would accrue \r\nto coal bearing state during the life of mine/lease period as given below:\r\n',
 u'\r\n\r\nS.No\tState\t\tSpecified End \u2013Use\t\t\tName of Coal Mine\t\tEstimated Revenueduring \r\n\t\t\t\t\t\t\t\t\t\t\t\tthe life of mine/lease \r\n\t\t\t\t\t\t\t\t\t\t\t\tperiod (Rs. In Crores)\r\n1\tChattishgarh\tNon-Regualted Sector\t\t\tChotia\t\t\t\t51596\r\n\t\t\t\t\t\t\t\tGare Palma IV-4\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-5\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-7\t\r\n\t\t\t\t\t\t\t\tGare-Palma Sector-IV/8\r\n2\tJharkhand\tNon-Regualted Sector\t\t\tBrinda and Sasai\t\t49272\r\n\t\t\t\t\t\t\t\tDumri\r\n\t\t\t\t\t\t\t\tKathautia\r\n\t\t\t\t\t\t\t\tLohari\r\n\t\t\t\t\t\t\t\tMeral\r\n\t\t\t\t\t\t\t\tMoitra\r\n\t\t\tPower\t\t\t\t\tGaneshpur\r\n\t\t\t\t\t\t\t\tJitpur\r\n\t\t\t\t\t\t\t\tTokisud North\r\n3\tMadhya Pradesh\tNon-Regualted Sector\t\t\tBicharpur\t\t\t42811\r\n\t\t\t\t\t\t\t\tMandla North\r\n\t\t\t\t\t\t\t\tMandla-South\r\n\t\t\t\t\t\t\t\tSialGhoghri\r\n\t\t\tPower\t\t\t\t\tAmelia North\r\n4\tMaharashtra\tNon-Regualted Sector\t\t\tBelgaon\t\t\t\t2738\r\n\t\t\t\t\t\t\t\tMarkiMangli III\r\n\t\t\t\t\t\t\t\tNerad Malegaon\r\n5\tOdisha\t\tPower\t\t\t\t\tMandakini\t\t\t33741\r\n\t\t\t\t\t\t\t\tTalabira-I\r\n\t\t\t\t\t\t\t\tUtkal - C\r\n6\tWest Bengal\tNon-Regualted Sector\t\t\tArdhagram\t\t\t13354\r\n\t\t\tPower\t\t\t\t\tSarisatolli\r\n\t\t\t\t\t\t\t\tTrans Damodar\r\n\tTotal\t\t\t\t\t\t\t(29) coal blocks\t\t193512\r\n',
 u'\r\n\r\n\r\nCoal mine has been assigned to successful bidder as Designated Custodian in view of a court case.\r\n\r\n',
 u'\r\nIn addition, an estimated amount of Rs. 1,41,854 Crores would accrue to coal bearing States from allotment \r\nof 38 coal mines to Central and State PSU\u2019s.\r\n\r\n',
 u'Out of these 29 coal mines, 16 are operational coal mines included in Schedule-II of the Act and 13 are \r\nnon-operational included in Schedule-III of the Act. Milestones for development and production of coal \r\nfrom the auctioned coal mines have been prescribed under the Coal Mines Development and Production Agreement \r\nsigned with the Successful Bidder. \r\n\r\n ',
 u'(c) & (d): Yes, Sir. A few complaints were received regarding cartelization in bidding. It is not possible to \r\nconclusively establish the same until investigation are carried out by Competent Authority. ',
 u'\r\n\r\n\r\nThe Government has not approved the recommendation of NA for declaration of successful bidder in case of \r\n4 coal mines namely Gare Palma IV/2&3, Gare Palma IV/1 and Tara as final closing bid price was not found \r\nto be reflecting fair value.  ',
 u'\r\n\r\n\r\n']

当您处理tables时,有时会发生这种情况,有关详细信息,请参阅this

答案 1 :(得分:0)

至少部分难点在于,您在控制台中看到的代码您的蜘蛛作为响应获取的源html(并且选择器在其上运行) )。 特别是,<table>不包含<tbody>是极为常见的。但是当你的浏览器将html转换为DOM树时,它会在<tbody>标签中出现。有一段时间,网页的大部分布局实际上是用(疯狂的)嵌套表格完成的。因此,此类网站的DOM通常会包含比html源更多的<tbody>个元素。

实际上这意味着:

  1. 为您想要选择的元素找到一个相对简单的xpath(或CSS选择器,或者......)通常是一个好主意 - 而不是您有时从开发人员工具中获得的庞然大物。
  2. 在您的xpath中包含/tbody通常是个坏主意(除非有相关属性,表明该标记存在于源HTML中)。
  3. 对于相关网站,

     response.xpath('//td[@class="griditemq"]').extract()
    

    返回一个列表,第一个元素是问题,第二个元素是答案。

相关问题