sklearn LogisticRegression:是否使用多个后台线程?

时间:2018-12-12 18:34:02

标签: python scikit-learn python-multiprocessing python-multithreading

我有使用sklearn.linear_model.LogisticRegressionsklearn.ensemble.RandomForestClassifier的代码。代码中所有其他内容保持不变,使用多进程池运行代码会在逻辑回归路径中启动数百个线程,因此完全妨碍了性能-36个处理器的htop屏幕截图:

空闲:

enter image description here

森林(一个处理器按预期保持空闲状态):

enter image description here

物流(所有处理器的使用率均为100%):

enter image description here

逻辑回归是否会产生后台线程(是),如果是,是否有办法防止这种情况发生?

$ python3.6
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
>>> sklearn.__version__
'0.20.1'

2 个答案:

答案 0 :(得分:1)

在实例化sklearn.linear_model.LogisticRegression时,您始终可以通过n_jobs=N传递要使用的线程数,其中N是所需的线程数。我将检查是否使用n_jobs=1运行它没有帮助。否则,Python可能会误读您环境中可用线程的数量。为确保其性能良好,我将进行检查。

import multiprocessing
print(multiprocessing.cpu_count())

内幕LogisticRegression使用sklearn.externals.joblib.Parallel进行穿线。它的逻辑相当复杂,因此如果没有对您的环境设置的全面了解,就很难说出它的确切功能。

答案 1 :(得分:1)

假设在拟合模型时发生这种情况,请查看模型的fit()方法源代码(link)的这一部分:

    # The SAG solver releases the GIL so it's more efficient to use
    # threads for this solver.
    if solver in ['sag', 'saga']:
        prefer = 'threads'
    else:
        prefer = 'processes'
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
                           **_joblib_parallel_args(prefer=prefer))(
        path_func(X, y, pos_class=class_, Cs=[self.C],
                  fit_intercept=self.fit_intercept, tol=self.tol,
                  verbose=self.verbose, solver=solver,
                  multi_class=multi_class, max_iter=self.max_iter,
                  class_weight=self.class_weight, check_input=False,
                  random_state=self.random_state, coef=warm_start_coef_,
                  penalty=self.penalty,
                  max_squared_sum=max_squared_sum,
                  sample_weight=sample_weight)
        for class_, warm_start_coef_ in zip(classes_, warm_start_coef))

热衷情况

prefer = 'threads'
**_joblib_parallel_args(prefer=prefer)

如果您正在使用sagsaga求解器,则可能会遇到线程问题。但是默认的求解器是liblinear

另外,从上面(link)上使用的Parallel()的来源来看,sklearn表示这种可能的解决线程问题的方法:

'threading' is a low-overhead alternative that is most efficient for
functions that release the Global Interpreter Lock: e.g. I/O-bound code or
CPU-bound code in a few calls to native code that explicitly releases the
GIL.
In addition, if the `dask` and `distributed` Python packages are installed,
it is possible to use the 'dask' backend for better scheduling of nested
parallel calls without over-subscription and potentially distribute
parallel calls over a networked cluster of several hosts.

据我了解,类似以下内容可以减少线程:

from dask.distributed import Client
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression


...
# create local cluster
client = Client(processes=False)             
model = LogisticRegression()
with joblib.parallel_backend('dask'):
    model.fit(...)
...

按照建议使用Dask Joblib