以下抓取的链接

时间:2018-05-18 19:09:20

标签: python web-scraping scrapy web-crawler

拥有以下蜘蛛:

import scrapy
from final.items import FinalItem

class ScrapeMovies(scrapy.Spider):
    name='final'

    start_urls = [
        'https://www.trekearth.com/members/page1.htm?sort_by=md'
    ]

    def parse(self, response):
        for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):

            item = FinalItem()

            item['name'] = row.xpath('./td[2]//a/text()').extract_first()
            website = row.xpath('./td[2]//a/@href/text()').extract_first()
            request = scrapy.Request(website,
            callback=self.parse_page2)
            yield request

    def parse_page2(self, response):
            request.meta['item'] = item
            item['travelog'] = response.xpath('string(//div[@class="statistics-btm"]/ul//li[position()=4]/a)').extract_first()
            yield item

#       next_page=response.xpath('//div[@class="page-nav-btm"]/ul/li[last()]/a/@href').extract_first()
#       if next_page is not None:
#            next_page=response.urljoin(next_page)
#            yield scrapy.Request(next_page, callback=self.parse)

我有table我想从此表中删除名称(以及其他信息),然后按照每个用户配置文件的链接,然后从这些配置文件中收集一些数据,然后将其合并到单个项目。

然后我想返回主表并转到它的下一页直到结束(代码的最后部分负责,为方便起见,它被注释掉了。)

我编写的代码无法正常工作。我遇到的错误是:

TypeError: Request url must be str or unicode, got NoneType:

如何解决这个问题?如何使其正确抓取所有数据?

1 个答案:

答案 0 :(得分:1)

您需要此代码(您的XPath表达式错误):

def parse(self, response):
    for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):

        item = FinalItem()

        item['name'] = row.xpath('./td[2]//a/text()').extract_first()
        profile_url = row.xpath('./td[2]//a/@href').extract_first()
        yield scrapy.Request( url=response.urljoin(profile_url), callback=self.parse_profile, meta={"item": item } )

    next_page_url = response.xpath('//div[@class="page-nav-btm"]//li[last()]/a/@href').extract_first()
    if next_page_url:
        yield scrapy.Request( url=response.urljoin(next_page_url), callback=self.parse )

def parse_profile(self, response):
        item = response.meta['item']
        item['travelog'] = response.xpath('//div[@class="statistics-btm"]/ul//li[ ./span[contains(., "Travelogues")] ]/a/text()').extract_first()
        yield item