为什么sklearn.linear_model.LogisticRegression与n_jobs = 2的表现如此糟糕?

时间:2018-04-26 13:59:36

标签: python machine-learning scikit-learn multiprocessing google-compute-engine

我正在使用scikit-learn包为MNIST数据库建立一个带有后勤恢复的模型。我注意到,使用默认参数时效果非常差,在找到this tutorial后,我将sklearn.linear_model.LogisticRegression解算器更改为'lbfgs'。幸运的是它运行良好,使用训练集中的所有60000个元素对模型进行了不到2分钟的训练。

我正在使用Google Compute Engine,因此我想使用多个内核并尝试更快地训练模型。我设置了一个包含2个内核的实例,并将n_jobs = 2放在LogisticRegression中。但是,该算法的表现比n_jobs = 1差。这是片段:

导入数据并将其转换为np.ndarray个对象:

import numpy as np
import matplotlib.pyplot as plt
from mnist import MNIST
mndata = MNIST('./data')

images_train, labels_train = mndata.load_training()
images_test, labels_test = mndata.load_testing()

labels_train = labels_train.tolist()
labels_test = labels_test.tolist()

X_train = np.array(images_train)
y_train = np.array(labels_train)
X_test = np.array(images_test)
y_test = np.array(labels_test)

主要功能:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
import time

def log_test(train_size, c, cores):
    X_train = X_train_all[:train_size]
    y_train = y_train_all[:train_size]

    start_time = time.time()

    logreg = LogisticRegression(C = c, solver = 'lbfgs', n_jobs = cores).fit(X_train, y_train)
    print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
    print("Test set score: {:.3f}".format(logreg.score(X_test_all, y_test_all)))

    elapsed_time = time.time() - start_time
    print(elapsed_time)

n_jobs = 1n_jobs = 2的效果:

  • log_test(2000, 100, 1) - 2s
  • log_test(2000, 100, 2) - 9s
  • log_test(5000, 100, 1) - 8s
  • log_test(5000, 100, 2) - 27s
  • log_test(7000, 100, 1) - 13s
  • log_test(7000, 100, 2) - 55s
  • log_test(15000, 100, 1) - 27s
  • log_test(15000, 100, 2) - 115s
  

问题。如何使用多个核心来提升算法的性能?

0 个答案:

没有答案