Question

有没有办法覆盖所说的文件而不是附加它？

实施例）

scrapy crawl myspider -o "/path/to/json/my.json" -t json    
scrapy crawl myspider -o "/path/to/json/my.json" -t json

将附加my.json文件而不是覆盖它。

Answer 1

scrapy crawl myspider -t json --nolog -o - > "/path/to/json/my.json"

Answer 2

为了解决这个问题，我在myproject目录中创建了一个scrapy.extensions.feedexport.FileFeedStorage的子类。

这是我的customexport.py：

"""Custom Feed Exports extension."""
import os

from scrapy.extensions.feedexport import FileFeedStorage


class CustomFileFeedStorage(FileFeedStorage):
    """
    A File Feed Storage extension that overwrites existing files.

    See: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/feedexport.py#L79
    """

    def open(self, spider):
        """Return the opened file."""
        dirname = os.path.dirname(self.path)
        if dirname and not os.path.exists(dirname):
            os.makedirs(dirname)
        # changed from 'ab' to 'wb' to truncate file when it exists
        return open(self.path, 'wb')

然后我将以下内容添加到我的settings.py（请参阅：https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-storages-base）：

FEED_STORAGES_BASE = {
    '': 'myproject.customexport.CustomFileFeedStorage',
    'file': 'myproject.customexport.CustomFileFeedStorage',
}

现在，每当我写入文件时，都会因此而被覆盖。

Answer 3

这是Scrapy的旧well-known "problem"。每次开始抓取并且您不想保留以前呼叫的结果时，您必须删除该文件。这背后的想法是，您希望在不同的时间范围内抓取不同的网站或同一网站，以免意外丢失已收集的结果。哪个可能不好。

解决方案是编写一个自己的项目管道，打开'w'而不是'a'的目标文件。

要了解如何编写此类管道，请查看文档：{{3}}（特别是对于JSON导出：http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline）

Answer 4

有一个允许覆盖输出文件的标志，您可以通过-O选项而不是-o传递文件引用来做到这一点，因此可以使用它：

scrapy crawl myspider -O /path/to/json/my.json

更多信息：

$ scrapy crawl --help
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
--help, -h              show this help message and exit
-a NAME=VALUE           set spider argument (may be repeated)
--output=FILE, -o FILE  append scraped items to the end of FILE (use - for
                        stdout)
--overwrite-output=FILE, -O FILE
                        dump scraped items into FILE, overwriting any existing
                        file
--output-format=FORMAT, -t FORMAT
                        format to use for dumping items

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

Answer 5

因为，接受的答案给了我无效json的问题，这可能有效：

find "/path/to/json/" -name "my.json" -exec rm {} \; && scrapy crawl myspider -t json -o "/path/to/json/my.json"

Answer 6

或者您可以添加：

import os

if "filename.json" in os.listdir('..'):
        os.remove('../filename.json')

在代码的开头。

非常容易。

Scrapy覆盖json文件而不是附加文件

6 个答案: