从html声明中提取元素

时间:2019-05-28 23:02:31

标签: python python-3.x scrapy

我正在使用scrapy选择器,并且尝试从下面的HTML声明中提取元素“ 1”:

<li aria-label="Pagina" class="page active"><a href="#">1</a></li>

在整个HTML源内容中,我有两个相等的声明。


<div class="row paging-bar">
    <ul class="sync-pagination pagination pull-right">
       <li aria-label="Pagina" class="prev"><a href="#">&lt;</a></li>
       <li aria-label="Pagina" class="page active"><a href="#">1</a></li>
       <li aria-label="Pagina" class="page"><a href="#">2</a></li>
       <li aria-label="Pagina" class="page"><a href="#">3</a></li>
       <li aria-label="Pagina" class="page"><a href="#">4</a></li>
       <li aria-label="Pagina" class="page"><a href="#">5</a></li>
       <li aria-label="Pagina" class="page"><a href="#">6</a></li>
       <li><span>...</span></li>
       <li aria-label="Pagina" class="page"><a href="#">1405</a></li>
      <li aria-label="Pagina" class="next"><a href="#">&gt;</a></li>
    </ul>
</div>

<div class="row paging-bar">
    <ul class="sync-pagination pagination pull-right">
       <li aria-label="Pagina" class="prev"><a href="#">&lt;</a></li>
       <li aria-label="Pagina" class="page active"><a href="#">1</a></li>
       <li aria-label="Pagina" class="page"><a href="#">2</a></li>
       <li aria-label="Pagina" class="page"><a href="#">3</a></li>
       <li aria-label="Pagina" class="page"><a href="#">4</a></li>
       <li aria-label="Pagina" class="page"><a href="#">5</a></li>
       <li aria-label="Pagina" class="page"><a href="#">6</a></li>
       <li><span>...</span></li>
       <li aria-label="Pagina" class="page"><a href="#">1405</a></li>
       <li aria-label="Pagina" class="next"><a href="#">&gt;</a></li>
    </ul>
</div></div>

我尝试了以下命令:

response.xpath("normalize-space(//li[@class='page active']/a[@href]/text())").extract_first()

但是它返回了一个空字符串。

1 个答案:

答案 0 :(得分:0)

有效。

>>> html = """
... <div class="row paging-bar">
...     <ul class="sync-pagination pagination pull-right">
...        <li aria-label="Pagina" class="prev"><a href="#">&lt;</a></li>
...        <li aria-label="Pagina" class="page active"><a href="#">1</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">2</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">3</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">4</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">5</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">6</a></li>
...        <li><span>...</span></li>
...        <li aria-label="Pagina" class="page"><a href="#">1405</a></li>
...       <li aria-label="Pagina" class="next"><a href="#">&gt;</a></li>
...     </ul>
... </div>
... """
>>> from parsel import Selector
>>> selector = Selector(text=html)
>>> selector.xpath("normalize-space(//li[@class='page active']/a[@href]/text())").extract_first()
'1'
>>> html = """
... <div class="row paging-bar">
...     <ul class="sync-pagination pagination pull-right">
...        <li aria-label="Pagina" class="prev"><a href="#">&lt;</a></li>
...        <li aria-label="Pagina" class="page active"><a href="#">1</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">2</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">3</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">4</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">5</a></li>
...        <li aria-label="Pagina" class="page"><a href="#">6</a></li>
...        <li><span>...</span></li>
...        <li aria-label="Pagina" class="page"><a href="#">1405</a></li>
...        <li aria-label="Pagina" class="next"><a href="#">&gt;</a></li>
...     </ul>
... </div></div>
... """
>>> selector = Selector(text=html)
>>> selector.xpath("normalize-space(//li[@class='page active']/a[@href]/text())").extract_first()
'1'