Question

我正在使用distributed，这是一个允许并行计算的框架。在这里，我的主要用例是NumPy。当我包含依赖np.linalg的NumPy代码时，我收到OMP_NUM_THREADS的错误，该错误与OpenMP library有关。

一个最小的例子：

from distributed import Executor
import numpy as np
e = Executor('144.92.142.192:8786')

def f(x, m=200, n=1000):
    A = np.random.randn(m, n)
    x = np.random.randn(n)
    #  return np.fft.fft(x)  # tested; no errors
    #  return np.random.randn(n)  # tested; no errors
    return A.dot(y).sum()  # tested; throws error below

s = [e.submit(f, x) for x in [1, 2, 3, 4]]
s = e.gather(s)

当我使用linalg测试时，e.gather失败，因为每个作业都会抛出以下错误：

OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.

我应该将OMP_NUM_THREADS设置为什么？

Answer 1

简短回答

export OMP_NUM_THREADS=1

or 

dask-worker --nthreads 1

解释

OMP_NUM_THREADS环境变量控制许多库（包括BLAS库供电numpy.dot）在其计算中使用的线程数，如矩阵乘法。

这里的冲突是你有两个相互调用的并行库，BLAS和dask.distributed。每个库都设计为使用与系统中可用的逻辑核心一样多的线程。

例如，如果您有八个核心，那么dask.distributed可能会在不同的线程上一次运行您的函数f八次。 numpy.dot中的f函数调用将在每次调用时使用八个线程，从而导致一次运行64个线程。

这实际上很好，你会遇到性能损失，但一切都可以正常运行，但它会比你一次只使用8个线程要慢，无论是通过限制dask.distributed还是限制BLAS。

您的系统可能OMP_THREAD_LIMIT设置了一些合理的数字，例如16，以便在发生此事件时向您发出警告。

Answer 2

如果您使用的是MKL刀片，则可能在使用TBB线程层时也会有所改进。我实际上还没有机会尝试过，所以YMMV。

http://conference.scipy.org/proceedings/scipy2018/anton_malakhov.html

使用dask分发时出现OMP_NUM_THREADS错误

2 个答案:

简短回答

解释