Question

我被要求每天从服务器导入一个csv文件，并将各自的标头解析为猫鼬中的相应字段。

我的第一个想法是使用cron模块使它与调度程序一起自动运行。

const CronJob = require('cron').CronJob;
const fs      = require("fs");
const csv     = require("fast-csv")

new CronJob('30 2 * * *', async function() {
  await parseCSV();
  this.stop();
}, function() {
  this.start()
}, true);

接下来，parseCSV()功能代码如下：（我已经简化了一些数据）

function parseCSV() {
  let buffer = [];

  let stream = fs.createReadStream("data.csv");
  csv.fromStream(stream, {headers:
        [
              "lot", "order", "cwotdt"
        ]
  , trim:true})
  .on("data", async (data) =>{
        let data = { "order": data.order, "lot": data.lot, "date": data.cwotdt};

        // Only add product that fulfill the following condition
        if (data.cwotdt !== "000000"){
              let product = {"order": data.order, "lot": data.lot}
              // Check whether product exist in database or not
              await db.Product.find(product, function(err, foundProduct){
                    if(foundProduct && foundProduct.length !== 0){
                          console.log("Product exists")
                    } else{
                          buffer.push(product);
                          console.log("Product not exists")
                    }    
              })
        }
  })
  .on("end", function(){
        db.Product.find({}, function(err, productAvailable){
              // Check whether database exists or not
              if(productAvailable.length !== 0){
                    // console.log("Database Exists");
                    // Add subsequent onward
                    db.Product.insertMany(buffer)
                    buffer = [];
              } else{
                    // Add first time
                    db.Product.insertMany(buffer)
                    buffer = [];
              }
        })
  });
}

如果它只是csv文件中的几行，而仅达到2k行，那不是问题，我遇到了问题。罪魁祸首是由于在侦听事件处理程序if时进行了on条件检查，它需要检查每一行以查看数据库是否已包含数据。

我这样做的原因是，csv文件将添加新数据，如果数据库为空，则需要第一次添加所有数据，或者只查看每一行，而仅添加这些新数据变成猫鼬。

我从此处（如代码中）所做的第一种方法是使用async/await来确保在继续进行事件处理程序end之前已读取所有数据。这有帮助，但是我不时看到（使用mongoose.set("debug", true);），两次查询了一些数据，我不知道为什么。

第二种方法不是使用async/await功能，这是有缺点的，因为没有完全查询数据，它直接进入事件处理程序end，然后进入insertMany能够被推入缓冲区的数据。

如果我坚持使用当前的方法，那不是问题，但是查询将需要1到2分钟，更不用说数据库持续增长的情况。因此，在查询的那几分钟内，事件队列被阻塞，因此在向服务器发送请求时，服务器超时。

在此代码之前，我使用了stream.pause()和stream.resume()，但由于它只是直接跳转到end事件处理程序，所以我无法使用它。由于end事件处理程序在on事件处理程序之前运行，这导致缓冲区每次都为空

我不记得我曾经使用过的链接，但是我从中获得的基本知识就是通过此链接获得的。

Import CSV Using Mongoose Schema

我看到了这些线程：

Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS

Can't populate big chunk of data to mongodb using Node.js

与我所需要的相似，但是对于我来说了解事情有点太复杂了。好像使用socket或child process？此外，在添加到缓冲区

之前，我仍然需要检查条件

有人在这方面指导我吗？

编辑：等待状态已从console.log中删除，因为它不是异步的

Answer 1

如果创建订单和手数索引。查询应该很快。

db.Product.createIndex( { order: 1, lot: 1 }

注意：这是一个复合索引，可能不是理想的解决方案。 Index strategies

此外，您在console.log上的等待很奇怪。这可能会导致您的计时问题。 console.log不是异步的。此外，该功能未标记为异步

        // removing await from console.log
        let product = {"order": data.order, "lot": data.lot}
          // Check whether product exist in database or not
          await db.Product.find(product, function(err, foundProduct){
                if(foundProduct && foundProduct.length !== 0){
                      console.log("Product exists")
                } else{
                      buffer.push(product);
                      console.log("Product not exists")
                }    
          })

我会尝试删除console.log上的await（如果console.log用于stackoverflow并隐藏实际的async方法，则可能会出现红色提示）。但是，如果那是情况。

最后，如果问题仍然存在。我可能会研究一种2层方法。

将CSV文件中的所有行插入mongo集合中。
在解析完CSV之后处理mongo集合。从等式中删除CSV。

Answer 2

派生子流程方法：

当Web服务收到csv数据文件请求时，将其保存在应用程序中的某个地方
派生一个子进程-> child process example
将文件url传递给child_process来运行插入检查
子进程完成csv文件处理后，删除该文件

就像Joe所说的那样，在有很多（几百万）元组时，对数据库进行索引将大大加快处理时间。

CSV文件处理困难，浏览器超时

2 个答案: