阻止tf.contrib.StagingArea get()和put()操作

时间:2017-09-11 17:24:23

标签: tensorflow tensorflow-gpu

工作环境

  • TensorFlow发布版本:1.3.0-rc2
  • TensorFlow git版本:v1.3.0-rc1-994-gb93fd37
  • 操作系统:CentOS Linux版本7.2.1511(核心)

问题情景

我正在使用TensorFlow StagingArea操作来提高输入管道的效率。以下是构建输入管道的代码片段的一部分:

  train_put_op_list = []
    train_get_op_list = []
    val_put_op_list = []
    val_get_op_list = []
    with tf.variable_scope(tf.get_variable_scope()) as vscope:
        for i in range(4):
            with tf.device('/gpu:%d'%i):
                with tf.name_scope('GPU-Tower-%d'%i) as scope:
                    trainstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32],
                                                                 shapes=[[64, 221, 221, 3],[64]],
                                                                      capacity=0)
                    valstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32],
                                                                      shapes=[[128, 221, 221, 3],[128]],
                                                                      capacity=0)
                    train_put_op_list.append(trainstagingarea.put(train_iterator.get_next()))
                    val_put_op_list.append(valstagingarea.put(val_iterator.get_next()))
                    train_get_op_list.append(trainstagingarea.get())
                    val_get_op_list.append(valstagingarea.get())
                    with tf.device('/cpu:0'):
                        worktype = tf.get_variable("wt",[], initializer=tf.zeros_initializer(), trainable=False)
                    workcondition = tf.equal(worktype, 1)
                    #elem = tf.cond(workcondition, lambda: train_iterator.get_next(), lambda: val_iterator.get_next())
                    elem = tf.cond(workcondition, lambda: train_get_op_list[i], lambda: val_get_op_list[i])
                    # This is followed by the network construction and optimizer 

现在在执行时,我首先运行put() ops几次,然后继续运行迭代。如下所示:

with tf.Session(config=config) as sess:
        sess.run(init_op)
        sess.run(iterator_training_op)
        sess.run(iterator_validation_op)
        sess.run(tf.assign(worktype, 0))
        for i in range(4):
            sess.run(train_put_op_list)
            sess.run(val_put_op_list)
        writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
        epoch = 0
        iter = 0
        previous = 0
        while(epoch<10):
            try:
                if(PROCESSINGTYPE is 'validation'):
                    sess.run(val_put_op_list)
                    [val_accu, summaries, numsamp] = sess.run([running_accuracy, validation_summary_op, processed])
                    previous+=numsamp
                    print("Running Accuracy = {} : Number of sample processed = {} ".format(val_accu, previous))
                else:
                    sess.run(train_put_op_list)
                    [loss_value, _, train_accu, summaries, batch_accu, numsamp] = sess.run([total_loss, apply_gradient_op, running_accuracy, training_summary_op, batch_accuracy, pr\
ocessed])
                    #Remaining part of the code (not important for question)

问题说明

StagingArea的使用大大提高了速度(几乎3-4倍)。 但是,由于某些阻止,代码会挂起。我不确定该块是来自get()还是put()操作。这是实际输出:

# Validation is done first and the following is the output
Running Accuracy = 0.0 : Number of sample processed = 512
Running Accuracy = 0.00390625 : Number of sample processed = 1024
Running Accuracy = 0.0 : Number of sample processed = 1536
Running Accuracy = 0.001953125 : Number of sample processed = 2048
# The code hangs here

您可以注意到,在tf.Session() as sess:的开头,get()put()操作的运行时间为4次。输出也限制为4行。这意味着, sess.run(val_put_op_list)循环中的while不会执行任何操作。因此,当get()调用sess.run(running_accuracy)...时,StagingArea4行之后被发现为空,因此会发生阻塞。

  • 我对问题的分析是否正确?
  • 在此处使用get()put()操作的正确方法是什么?
  • 如果StagingArea已满且put()被阻止,那还会阻止整个代码吗? TensorFlow文档没有说明任何内容。

1 个答案:

答案 0 :(得分:1)

看看https://github.com/tensorflow/tensorflow/pull/13684。这解决了一些死锁,可能会进入1.4.0。免责声明:我不是张花。