Question

我正在构建一个大的Lucene索引，我插入的每个文档在插入之前都需要一些“拼凑”。我正在读取数据库中的所有文档并将它们插入到索引中。 Lucene允许你构建一些不同的索引并在以后将它们合并在一起，所以我想出了这个：

// we'll use a producer/consumer pattern for the job
var documents = new BlockingCollection<Document>();

// we'll have a pool of index writers (each will create its own index)
var indexWriters = new ConcurrentBag<IndexWriter>();

// start filling the collection with documents
Task writerTask = new Task(() => {
    foreach(document in database)
        documents.Add(document);
    domains.CompleteAdding();
}, TaskCreationOptions.LongRunning);
writerTask.Start();

// iterate through the collection, obtaining index writers from the pool and
// creating them when necessary.
Parallel.ForEach(documents.GetConsumingEnumerable(token.Token), document =>
{
    IndexWriter writer;
    if(!indexWriters.TryTake(out writer))
    {
        var dirInfo = new DirectoryInfo(string.Concat(_indexPath, "\\~", Guid.NewGuid().ToString("N")));
        dirInfo.Create();
        var dir = FSDirectory.Open(dirInfo);
        var indexWriter = new IndexWriter(dir, getAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
    }
    // prepare and insert the document into the current index
    WriteDocument(writer, document);
    indexWriters.Add(writer); // put the writer back in the pool
});

// now get all of the writers and merge the indexes together...

让我暂停的唯一问题是，从每次迭代中将IndexWriter从池中拉出（然后再将其放回到最后）可能效率低于仅创建最佳线程数，但是我也知道ConcurrentBag效率很高，处理开销极低。

我的解决方案可以吗？还是会尖叫出来寻求更好的解决方案？

更新：

经过一些测试，从数据库加载比我想的实际索引慢一点。最后的索引合并也很慢，因为我只能使用一个线程，而且我正在合并16个索引，大约有170万个文档。不过，我对最初的问题持开放态度。

Answer 1

Parallel.ForEach我遇到的一个问题是，当CPU利用率较低时，它可以决定在每个核心上添加超出正常值的线程。这对于等待远程服务器响应的任务是有意义的，但是对于缓慢的磁盘密集型进程，这有时会导致性能不佳，因为磁盘现在正在颠簸。

如果您的处理是磁盘绑定而不是CPU绑定，您可能想尝试添加ParallelOptions并将MaxDegreeOfParallelism设置为Parallel.ForEach，以确保它不会不必要地颠簸磁盘。 / p>

这种并行化代码的效率如何？有没有更好的方法呢？

1 个答案: