如何使用scrapy从未知的第n个子级p标签获取文本?

时间:2020-04-29 11:25:47

标签: python scrapy web-crawler

我正在尝试获取事件的描述。但是问题在于所有事件的描述都位于任意<p>标签处。那么我们如何访问该<p>标签以获取其文本?

<div id='main'>
   <div class='templatecontent'>
       <h3>Evening Tide Talk-POSTPONED<img alt="" src="https://assets.speakcdn.com/assets/2204/hj_scope-2020022008493216.jpg" style="margin: 4px 14px; float: right; width: 300px; height: 463px;" /></h3>

       <p><strong>March 25th | 5:45 p.m. </strong></p>

       <p><strong>Dr. Heather Judkins</strong></p>

       <p><strong>University of South Florida St. Petersburg, Department of Biological Sciences</strong></p>

       <p><strong><em>Lessons Learned from Exploring the Deep</em></strong></p>
       <!-- I want to get this Paragraph --!>

       <p>In her talk, Heather will share lessons learned and some unexpected finds from her journeys. Join us as she discusses unique cephalopod adaptations and memorable moments, while also sharing some “giant” findings from her most recent Gulf of Mexico cruise that led to breaking news in June 2019’s New York Times!</p>

       <p><a class="button-primary" href="/eveningtidetalks">Learn More</a></p>

       <p> </p>

       <p> </p>

       <p> </p>

       <hr />
       <h3>Washed Ashore - Art To Save The Sea <img alt="" src="https://assets.speakcdn.com/assets/2204/tfa_washed_ashore_exhibit_priscilla2.png" style="margin: 3px 13px; float: right; width: 300px; height: 300px;" /></h3>

       <p><strong><strong>February 29th - August 31st</strong></strong></p>

       <!-- I want to get this Paragraph --!>
       <p>In honor of the Aquarium's 25th Anniversary celebration, we are proud to host Washed Ashore - Art To Save The Sea from now until the end of August! The nationally acclaimed exhibit artistically showcases the impacts of plastic pollution on oceans, waterways and wildlife. Washed Ashore sculptures have traveled around the country and The Florida Aquarium is showcasing 18 larger than life sculptures of marine life. </p>

       <p><a class="button" href="/washed-ashore">Learn More</a></p>

       <p> </p>

       <hr />
   </div>
</div>

如您所见,这里。

1 个答案:

答案 0 :(得分:0)

您需要结合following-sibilng axis来选择与<p>相同级别的h3标签,然后将匹配p的标签限制为将text()作为直系子女。但是,如果一个人只是p[text()],它将带回(或多或少)次优的空<p> </p>。因此,使用string-length进一步限制,使其只返回看起来“有趣”的内容,从而产生:

def parse(self, response):
    main_div = response.css('#main')
    for h3 in main_div.xpath('.//h3'):
        talk_title = h3.xpath('text()').get()
        talk_summary = h3.xpath('./following-sibling::p[string-length(text()) > 2]/text()').get()

产生:

[
  {
    "talk_title": "Evening Tide Talk-POSTPONED",
    "talk_summary": "In her talk, Heather will share lessons learned and some unexpected finds from her journeys. Join us as she discusses unique cephalopod adaptations and memorable moments, while also sharing some “giant” findings from her most recent Gulf of Mexico cruise that led to breaking news in June 2019’s New York Times!"
  },
  {
    "talk_title": "Washed Ashore - Art To Save The Sea ",
    "talk_summary": "In honor of the Aquarium's 25th Anniversary celebration, we are proud to host Washed Ashore - Art To Save The Sea from now until the end of August! The nationally acclaimed exhibit artistically showcases the impacts of plastic pollution on oceans, waterways and wildlife. Washed Ashore sculptures have traveled around the country and The Florida Aquarium is showcasing 18 larger than life sculptures of marine life. "
  }
]

following-sibling::p轴表示要匹配DOM中与锚定XPath的元素(在本例中为<p>)处于同一级别的所有<h3>元素,产生9个<p>标签的列表。 p[] XPath语法表示要进一步限制满足某些谓词的匹配p标记,其中string-length(text()) > 2表示立即文本子节点的字符串长度必须大于2。然后,匹配的<p>标签中,返回第一个text子节点

相关问题