Question

我想问一个问题当我使用css选择器时，extract()会将输出内容作为列表所以如果css选择器没有值
它将在终端中显示错误（如下所示），并且蜘蛛不会获取我的json文件中的任何项目

item['intro'] = intro[0]
exceptions.IndexError: list index out of range

所以我使用try除了检查列表是否存在

    sel = Selector(response)
    sites = sel.css("div.con ul > li")
    for site in sites:
        item = Shopping_appleItem()
        links = site.css("  a::attr(href)").extract()
        title = site.css("  a::text").extract()
        date = site.css(" time::text").extract()

        try:
            item['link']  = urlparse.urljoin(response.url,links[0])
        except:
            print "link not found" 
        try:
            item['title'] = title[0]       
        except:
            print "title not found" 
        try:
            item['date'] = date[0]       
        except:
            print "date not found"

我觉得我尝试了很多，除了，我不知道这是不是一个好方法请指导我一点谢谢

Answer 1

您可以使用单独的函数来提取数据。例如，对于文本节点，示例代码在这里

    def extract_text(node):
        if not node:
            return ''
        _text = './/text()'
        extracted_list = [x.strip() for x in node.xpath(_text).extract() if len(x.strip()) > 0]
        if not extracted_list:
            return ''
        return ' '.join(extracted_list)

你可以像这样调用这个方法

    self.extract_text(sel.css("your_path"))

scrapy：另一种避免大量尝试的方法除外

1 个答案: