nodejs - 过滤巨大的json文件数据

时间:2016-06-02 14:31:02

标签: node.js algorithm big-o file-processing bigdata

我有两个带有图书ID的文件

- current.json [~10,000 lines]    -> books saved in the system
- feed.json    [~300,000 lines]   -> feed file contents all books from a book store

从这2个文件我要生成3个文件

- not_available.json -> books exists in current but not in feed
- to_be_updated.json -> books exists in both current and feed
- new.json           -> books exists only in the feed

因为文件太大我逐行读取文件,我无法将数据放入内存中

的伪代码

我的代码如下:

// export to_be_updated.json and new.json
feed <- initstream(feed.json)
while(lf <- feed.nextline())
    found <- false;
    current <- initstream(current.json)
    while(lc <- current.nextline())
        if(JSON.parse(lf).id == JSON.parse(lc).id)
            found <- true
            break
    if(found) then append(lf, to_be_updated.json)
    else append(lf, new.json)

// export not_avialbale.json
current <- initstream(current.json)
while(lc <- current.nextline())
    found <- false;
    feed <- initstream(feed.json)
    while(lf <- feed.nextline())
        if(JSON.parse(lf).id == JSON.parse(lc).id)
            found <- true
            break
    if not(found) then append(lc, not_available.json)

对于O(nm)n = 10,000,此代码的时间复杂度为m = 300,000O(1)的空间复杂度为500mb,因此core i5需要很长时间使用{"id": "12340", "title": "A life journey", "price": "34.00"} {"id": "12341", "title": "all over the world", "price": "42.00"} {"id": "12342", "title": "good to remember", "price": "60.00"} {"id": "12343", "title": "A night in Mars", "price": "14.00"} ...

大约需要2小时

我试图将逻辑仅放在一个嵌套循环中,但这是不可能的。我正在尝试使用未分类的文件来提高复杂度

你认为这是最好的方法吗?还有更好的方法吗?

更新(文件格式)

feed.json具有以下格式(示例)

var resolvedHub = container.Resolve<ITinyMessengerHub>();

0 个答案:

没有答案