使用xpath获取div文本,包括链接文本

时间:2017-12-14 22:14:19

标签: html xpath

将Tweet div的全文作为一个返回值(包括链接文本)获取的xpath选择器是什么?

  <div class="lead_table">
      <table id="lead_table"style="width:100%">
  <tr>
    <th width="3%" id="i_d">ID</th>
    <th width="35%" id="assessment">Assessment</th>
    <th width="17%" id="risk_scale">Risk Scale<br>(1=Low Risk; 5=High Risk)</br></th>
    <th width="5%" id="score">Score</th>
    <th width="35%" id="notes">Explanation/Notes/Proposed Action</th>
  </tr>

  <tr>
    <td align="center">L1</td>
    <td>Question 1</td>
    <td Scale 1</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L2</td>
    <td>Question 2</td>
    <td align="center">Scale 2</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L3</td>
    <td>Question 3</td>
    <td align="center">Scale 3</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L4</td>
    <td>Question 4</td>
    <td align="center">Scale 4</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <<td align="center">L5</td>
    <td>Question 5</td>
    <td align="center">Scale 5</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L6</td>
    <td>Question 6</td>
    <td align="center">Scale 6</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L7</td>
    <td>Question 7</td>
    <td align="center">Scale 7</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L8</td>
    <td>Question 8</td>
    <td align="center">Scale 8</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L9</td>
    <td>Question 9</td>
    <td align="center">Scale 9</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L10</td>
    <td>Question 10</td>
    <td align="center">Scale 10</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L11</td>
    <td>Question 12</td>
    <td align="center">Scale 12</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L12</td>
    <td>Question 12</td>
    <td align="center">Scale 12</td>
    <td id="ans"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr>
    <td align="center">L13</td>
    <td>Question 13</td>
    <td align="center">Scale 13</td>
    <td id="ans13"><input type="number" min="1" max="5" style="text-align: center;"></td>
    <td><input type="text"></td>
  </tr>

  <tr style="background-color:#4f81bd;">
    <td colspan="3" align="left" style="color:white"><strong>Results</strong></td>
    <td colspan="2" style="color:white" id="lead_res_num"><strong></strong></td>
  </tr>

  </table>
    </div>

以上对没有链接的div有效,但是当推文包含链接时,它只返回第一个字符串段。

1 个答案:

答案 0 :(得分:0)

  

以上对没有链接的div有效,但是当推文包含链接时,它只返回第一个字符串段。

这是因为/text()部分 - 您基本上只匹配顶级文本子节点。要匹配元素内的所有文本节点,在任何级别,您都可以执行以下操作:

//*[contains(@class, 'tweet-text')][2]//text()

这通常是HTML解析器在询问&#34; text&#34;节点的值 - 它们递归地转到所有子节点并获得&#34;文本&#34;值 - 然后加入它们。

使用Python + lxml解析器演示上述所有内容:

In [1]: from lxml.html import fromstring 

In [2]: html = """
    ...: <div>
    ...:     div text here
    ...:     <a href="https://google.com">link text</a>
    ...: </div>"""

In [3]: root = fromstring(html)

In [4]: root.xpath('//div/text()')  # <- No text of the a element
Out[4]: ['\n    div text here\n    ', '\n']

In [5]: root.xpath('//div//text()')  # <- We've got all the texts now
Out[5]: ['\n    div text here\n    ', 'link text', '\n']

In [6]: root.xpath("//div")[0].text_content()  # <- but this would that for us
Out[6]: '\n    div text here\n    link text\n'