Keras没有使用多个核心

时间:2016-04-28 08:15:31

标签: python-3.4 theano blas keras openblas

基于着名的check_blas.py脚本,我写了一个来检查theano实际上可以使用多个核心:

import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'

import numpy
import theano
import theano.tensor as T

M=2000
N=2000
K=2000
iters=100
order='C'

a = theano.shared(numpy.ones((M, N), dtype=theano.config.floatX, order=order))
b = theano.shared(numpy.ones((N, K), dtype=theano.config.floatX, order=order))
c = theano.shared(numpy.ones((M, K), dtype=theano.config.floatX, order=order))
f = theano.function([], updates=[(c, 0.4 * c + .8 * T.dot(a, b))])

for i in range(iters):
    f(y)

python3 check_theano.py运行此表示正在使用8个线程。更重要的是,代码的运行速度比没有os.environ设置快9倍,而fit设置仅适用于1个核心:7.863s与71.292s一次运行。

所以,我希望Keras现在在调用predict(或import os os.environ['MKL_NUM_THREADS'] = '8' os.environ['GOTO_NUM_THREADS'] = '8' os.environ['OMP_NUM_THREADS'] = '8' os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran' import numpy from keras.models import Sequential from keras.layers import Dense coeffs = numpy.random.randn(100) x = numpy.random.randn(100000, 100); y = numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01 model = Sequential() model.add(Dense(20, input_shape=(100,))) model.add(Dense(1, input_shape=(20,))) model.compile(optimizer='rmsprop', loss='categorical_crossentropy') model.fit(x, y, verbose=0, nb_epoch=10) 时)也使用多个核心。但是,以下代码不是这种情况:

Using Theano backend.
/home/herbert/venv3/lib/python3.4/site-packages/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
warnings.warn("downsample module has been moved to the pool module.")

此脚本仅使用1个核心输出:

fit

为什么Keras的check_blas.py仅使用1个核心进行相同的设置? (venv3)herbert@machine:~/ $ python3 -c 'import numpy, theano, keras; print(numpy.__version__); print(theano.__version__); print(keras.__version__);' ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again. 1.11.0 0.8.0rc1.dev-e6e88ce21df4fbb21c76e68da342e276548d4afd 0.3.2 (venv3)herbert@machine:~/ $ 脚本是否真正代表神经网络训练计算?

供参考:

import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'

import numpy
import theano
import theano.tensor as T

M=2000
N=2000
K=2000
iters=100
order='C'

coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100).astype(theano.config.floatX)
y = (numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01).astype(theano.config.floatX).reshape(100000, 1)

x_shared = theano.shared(x)
y_shared = theano.shared(y)

x_tensor = T.matrix('x')
y_tensor = T.matrix('y')

W0_values = numpy.asarray(
    numpy.random.uniform(
        low=-numpy.sqrt(6. / 120),
        high=numpy.sqrt(6. / 120),
        size=(100, 20)
    ),
    dtype=theano.config.floatX
)
W0 = theano.shared(value=W0_values, name='W0', borrow=True)

b0_values = numpy.zeros((20,), dtype=theano.config.floatX)
b0 = theano.shared(value=b0_values, name='b0', borrow=True)

output0 = T.dot(x_tensor, W0) + b0

W1_values = numpy.asarray(
    numpy.random.uniform(
        low=-numpy.sqrt(6. / 120),
        high=numpy.sqrt(6. / 120),
        size=(20, 1)
    ),
    dtype=theano.config.floatX
)
W1 = theano.shared(value=W1_values, name='W1', borrow=True)

b1_values = numpy.zeros((1,), dtype=theano.config.floatX)
b1 = theano.shared(value=b1_values, name='b1', borrow=True)

output1 = T.dot(output0, W1) + b1

params = [W0, b0, W1, b1]
cost = ((output1 - y_tensor) ** 2).sum()

gradients = [T.grad(cost, param) for param in params]

learning_rate = 0.0000001

updates = [
    (param, param - learning_rate * gradient)
    for param, gradient in zip(params, gradients)
]

train_model = theano.function(
    inputs=[],#x_tensor, y_tensor],
    outputs=cost,
    updates=updates,
    givens={
        x_tensor: x_shared,
        y_tensor: y_shared
    }
)

errors = []
for i in range(1000):
    errors.append(train_model())

print(errors[0:50:])

修改

我创建了一个简单MLP的Theano实现,它也不运行多核:

https://wordpress.org/plugins/cf7-autosaver/

1 个答案:

答案 0 :(得分:0)

Keras和TF本身不使用整个内核和CPU容量!如果您有兴趣使用全部100%的CPU,那么multiprocessing.Pool基本上会创建一个需要完成的工作池。流程将拾取并运行这些作业。作业完成后,该过程将从池中提取另一个作业。

NB:如果您只是想加快此模型的速度,请查看GPU或更改超参数,例如批大小和神经元数量(层大小)。

这里是如何使用multiprocessing来同时训练多个模型的方法(使用在计算机的每个单独CPU内核上并行运行的进程)。

此答案受@repploved启发

import time
import signal
import multiprocessing

def init_worker():
    ''' Add KeyboardInterrupt exception to mutliprocessing workers '''
    signal.signal(signal.SIGINT, signal.SIG_IGN)


def train_model(layer_size):
    '''
    This code is parallelized and runs on each process
    It trains a model with different layer sizes (hyperparameters)
    It saves the model and returns the score (error)
    '''
    import keras
    from keras.models import Sequential
    from keras.layers import Dense

    print(f'Training a model with layer size {layer_size}')

    # build your model here
    model_RNN = Sequential()
    model_RNN.add(Dense(layer_size))

    # fit the model (the bit that takes time!)
    model_RNN.fit(...)

    # lets demonstrate with a sleep timer
    time.sleep(5)

    # save trained model to a file
    model_RNN.save(...)

    # you can also return values eg. the eval score
    return model_RNN.evaluate(...)


num_workers = 4
hyperparams = [800, 960, 1100]

pool = multiprocessing.Pool(num_workers, init_worker)

scores = pool.map(train_model, hyperparams)

print(scores)

输出:

Training a model with layer size 800
Training a model with layer size 960
Training a model with layer size 1100
[{'size':960,'score':1.0}, {'size':800,'score':1.2}, {'size':1100,'score':0.7}]

使用代码中的time.sleep可以很容易地证明这一点。您会看到所有3个过程都开始了培训工作,然后它们都几乎同时完成。如果这是单处理的,则必须等待每个处理完成后才能开始下一个(打哈欠!)。