我如何使用scrapy提取两个级别的文本?

时间:2016-08-04 00:26:19

标签: python scrapy

我的代码无法正常运行。

第二个for循环没有得到所有文本。

我怎样才能在scrapy中做到这一点?

感谢您提供的任何提示,如果我遗漏了任何内容,请告诉我。

<dl>
<dt>Release Date:</dt>
<dd>Aug. 01, 2016<br>
</dd>

<dt>Runtime:</dt>
<dd itemprop="duration">200min.<br></dd>

<dt>Languages:</dt>
<dd>Japanese<br></dd>
<dt>Subtitles:</dt>
<dd>----<br></dd>
<dt>Content ID:</dt>
<dd>8dtkm00045<br></dd>
<dt>Actress(es):</dt>
<dd itemprop="actors">
    <span itemscope="" itemtype="http://schema.org/Person">
        <a itemprop="name">Shinobu Oshima</a>
    </span>

    <span itemscope="" itemtype="http://schema.org/Person">
        <a itemprop="name">Yukie Mizukami</a>
    </span>

</dd>

蜘蛛:

def parse_item(self, response):
    for sel in response.xpath('//*[@id="contents"]/div[10]/section/section[1]/section[1]'):
        item = EnMovie()
        Content_ID = sel.xpath('normalize-space(div[2]/dl/dt[contains (.,"Content ID:")]/following-sibling::dd[1]/text())').extract()
        item['Content_ID'] = Content_ID[0].encode('utf-8')
        release_date = sel.xpath('normalize-space(div[2]/dl[1]/dt[contains (.,"Release Date:")]/following-sibling::dd[1]/text())').extract()
        item['release_date'] = release_date[0].encode('utf-8')
        running_time = sel.xpath('normalize-space(div[2]/dl[1]/dt[contains (.,"Runtime:")]/following-sibling::dd[1]/text())').extract()
        item['running_time'] = running_time[0].encode('utf-8')
        Series = sel.xpath('normalize-space(div[2]/dl[2]/dt[contains (.,"Series:")]/following-sibling::dd[1]/text())').extract()
        item['Series'] = Series[0].encode('utf-8')
        Studio = sel.xpath('normalize-space(div[2]/dl[2]/dt[contains (.,"Studio:")]/following-sibling::dd[1]/a/text())').extract()
        item['Studio'] = Studio[0].encode('utf-8')
        Director = sel.xpath('normalize-space(div[2]/dl[2]/dt[contains (.,"Director:")]/following-sibling::dd[1]/text())').extract()
        item['Director'] = Director[0].encode('utf-8')
        Label = sel.xpath('normalize-space(div[2]/dl[2]/dt[contains (.,"Label:")]/following-sibling::dd[1]/text())').extract()
        item['Label'] = Label[0].encode('utf-8')
        item['image_urls'] = sel.xpath('div[1]/img/@src').extract()


        for actress in sel.xpath("//*[@itemprop='actors']//*[@itemprop='name']"):
            actress_ = actress.xpath("text()").extract()
           item['Actress'] = actress_[0].strip()
           yield item

部分蜘蛛效果很好。(第二个for循环除外)第二个for循环只产生最后一个[itemprop =“name”]值并保存到DB。

抱歉我的英文不好,谢谢您的任何提示。

1 个答案:

答案 0 :(得分:0)

用这个替换你的第二个循环:

actresses = sel.xpath("//*[@itemprop='actors']//*[@itemprop='name']/text()").extract()

item['Actress'] = [x.strip() for x in actresses]

yield item

它会给出一个有女演员名单的项目。

BYW,请停止再次发布相同的问题againagain