Python lxml没有拿起标签

时间:2016-03-17 17:04:57

标签: python html web-scraping lxml

您好我正试图通过网络搜索本次选举季的CNN主要结果,并用它做一些机器学习。在研究了一下之后,我正在使用Python 3.5,我看到我可以使用lxml和BeautifulSoup以及执行它的请求。在使用BeautifulSoup失败后(我尝试使用XPath但它没有拿起它),我尝试使用lxml。在爱荷华州的主页(以及迄今为止的每个州),CNN根据县和每位候选人的投票百分比将其分解。在查看html页面后,我看到每个县名都被存储,以便县名是div标签后面的h2标签的一部分(以及类属性),依此类推每个县。因此,我使用CSSSelector来尝试捕获(因为h2总是在一个县的div之后)。 html部分如下所示:

<div class="race-results__county-header race-results__county-name section-header__column" data-reactid=".0.4.3.0.0.0.0.$0.0.$0">
    <h2 class="section-heading" data-reactid=".0.4.3.0.0.0.0.$0.0.$0.0">Adair</h2>
</div>

代码如下:

from lxml import html
import requests

page = requests.get('http://www.cnn.com/election/primaries/counties/ia/Rep').text
doc = html.fromstring(page)
link = doc.cssselect("div h2")
print(link)

然而,当我尝试打印链接时,绝对没有任何东西(只是一个空数组[])。这是html如何布局,代码或解析器的问题?我正在使用JetBeans的PyCharm,但我不认为这与它有任何关系。我对这些东西很新,所以任何其他方法都会非常感激。

1 个答案:

答案 0 :(得分:0)

问题是,该页面不包含您期望的结果,因为它们可能是通过JavaScript呈现的。

当我从给定网址下载内容时,没有<h2>元素,但我发现了一条消息:请启用JavaScript查看CNN 2016年选举中心。

您没有收到数据,因为它们不在页面上。

不要因为浏览器可能会向您显示<h2>元素而感到困惑,因为JavaScript已将其放入其中。

提示:检查,页面加载的是哪些JSON文件。很可能,某些文件将为您的任务提供随时可用的数据。在我的网络浏览器中使用F12(并在之后刷新页面)我看到了许多JSON文件,其中一些提供了有关候选人的数据。

E.g。 url:http://data.cnn.com/ELECTION/2016primary/candidates/can1187.json返回以下内容(缩写):

{
  "candidateInfo": {
    "id": 1187,
    "fname": "Mike",
    "lname": "Huckabee",
    "party": "Rep",
    "rd": "1",
    "pd": "0",
    "td": "1",
    "d_nom": 1237,
    "inrace": true,
    "nominee": false,
    "rd_k": "1460",
    "td_k": 2472,
    "dpct": 0,
    "dpct_nom": 50,
    "states": [
      {
        "state": "Alabama",
        "code": "AL",
        "electiondate": "20160301",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Alaska",
        "code": "AK",
        "electiondate": "20160301",
        "primarytype": "caucus",
        "candidates": []
      },
      {
        "state": "Arizona",
        "code": "AZ",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "Arkansas",
        "code": "AR",
        "electiondate": "20160301",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Iowa",
        "code": "IA",
        "electiondate": "20160201",
        "primarytype": "caucus",
        "candidates": [
          {
            "id": 1187,
            "rd": "1",
            "pd": "0",
            "td": "1",
            "winner": false
          }
        ]
      },
      {
        "state": "Kansas",
        "code": "KS",
        "electiondate": "20160305",
        "primarytype": "caucus",
        "candidates": []
      },
      {
        "state": "Kentucky",
        "code": "KY",
        "electiondate": "20160305",
        "primarytype": "caucus",
        "candidates": []
      },
      {
        "state": "Louisiana",
        "code": "LA",
        "electiondate": "20160305",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Maine",
        "code": "ME",
        "electiondate": "20160305",
        "primarytype": "caucus",
        "candidates": []
      },
      {
        "state": "Maryland",
        "code": "MD",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "Massachusetts",
        "code": "MA",
        "electiondate": "20160301",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Michigan",
        "code": "MI",
        "electiondate": "20160308",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Minnesota",
        "code": "MN",
        "electiondate": "20160301",
        "primarytype": "caucus",
        "candidates": []
      },
      {
        "state": "Mississippi",
        "code": "MS",
        "electiondate": "20160308",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Missouri",
        "code": "MO",
        "electiondate": "20160315",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Montana",
        "code": "MT",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "Nebraska",
        "code": "NE",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "Nevada",
        "code": "NV",
        "electiondate": "20160223",
        "primarytype": "caucus",
        "candidates": []
      },
      {
        "state": "New Hampshire",
        "code": "NH",
        "electiondate": "20160209",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "New Jersey",
        "code": "NJ",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "New Mexico",
        "code": "NM",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "New York",
        "code": "NY",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "North Carolina",
        "code": "NC",
        "electiondate": "20160315",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "North Dakota",
        "code": "ND",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "Ohio",
        "code": "OH",
        "electiondate": "20160315",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Oklahoma",
        "code": "OK",
        "electiondate": "20160301",
        "primarytype": "primary",
        "candidates": []
      },
      {
        "state": "Oregon",
        "code": "OR",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "Virgin Islands",
        "code": "VI",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      },
      {
        "state": "Northern Marianas",
        "code": "MP",
        "electiondate": "",
        "primarytype": "",
        "candidates": []
      }
    ],
    "races": [
      {
        "status": "called",
        "code": "AR",
        "state": "Arkansas",
        "polltype": "exit",
        "primarytype": "primary",
        "cresults": true,
        "cmap": true,
        "xpoll": true,
        "electiondate": "20160301",
        "pctsrep": 100,
        "ts": 1457130949809,
        "racerank": 6,
        "winner": false,
        "vpct": 1,
        "pctDecimal": "1.2",
        "inc": false,
        "votes": 4703,
        "cvotes": "4,703",
        "rd": "0",
        "pd": "0",
        "sd": "0",
        "td": "0",
        "position": 13
      },
      {
        "status": "called",
        "code": "GA",
        "state": "Georgia",
        "polltype": "exit",
        "primarytype": "primary",
        "cresults": true,
        "cmap": true,
        "xpoll": true,
        "electiondate": "20160301",
        "pctsrep": 92,
        "ts": 1457130978961,
        "racerank": 8,
        "winner": false,
        "vpct": 0,
        "pctDecimal": "0.2",
        "inc": false,
        "votes": 2615,
        "cvotes": "2,615",
        "rd": "0",
        "pd": "0",
        "sd": "0",
        "td": "0",
        "position": 13
      },
      {
        "status": "called",
        "code": "TN",
        "state": "Tennessee",
        "polltype": "exit",
        "primarytype": "primary",
        "cresults": true,
        "cmap": true,
        "xpoll": true,
        "electiondate": "20160301",
        "pctsrep": 100,
        "ts": 1457131086792,
        "racerank": 7,
        "winner": false,
        "vpct": 0,
        "pctDecimal": "0.3",
        "inc": false,
        "votes": 2404,
        "cvotes": "2,404",
        "rd": "0",
        "pd": "0",
        "sd": "0",
        "td": "0",
        "position": 15
      },
      {
        "status": "called",
        "code": "IA",
        "state": "Iowa",
        "polltype": "entrance",
        "primarytype": "caucus",
        "cresults": true,
        "cmap": true,
        "xpoll": true,
        "electiondate": "20160201",
        "pctsrep": 99,
        "ts": 1454997428611,
        "racerank": 9,
        "winner": false,
        "vpct": 2,
        "pctDecimal": "1.8",
        "inc": false,
        "votes": 3345,
        "cvotes": "3,345",
        "rd": "1",
        "pd": "0",
        "sd": "1",
        "td": "1",
        "position": 14
      },
      {
        "status": "called",
        "code": "AL",
        "state": "Alabama",
        "polltype": "exit",
        "primarytype": "primary",
        "cresults": true,
        "cmap": true,
        "xpoll": true,
        "electiondate": "20160301",
        "pctsrep": 100,
        "ts": 1456958822650,
        "racerank": 8,
        "winner": false,
        "vpct": 0,
        "pctDecimal": "0.3",
        "inc": false,
        "votes": 2535,
        "cvotes": "2,535",
        "rd": "0",
        "pd": "0",
        "sd": "0",
        "td": "0",
        "position": 13
      }
    ],
    "lts": 1458233488340
  }
}