Question

我正在尝试实施此代码：https://github.com/leehomyc/cyclegan-1 一切都很好，直到我的GPU耗尽内存。我的直觉告诉我，在算法的每个时代都应该释放GPU资源，为什么这似乎不是这样呢？它就像它总是积累GPU资源，直到它不能分配更多然后它抛出错误。我试过限制GPU的使用，但它似乎也无法正常工作。任何信息将不胜感激。我正在处理的图像集大小约为100 MB，我的图形卡有4 GB。谁可以指出我的错误？或者，如果您需要更多信息，请告诉我，我会为您提供。谢谢。最后，使用tf.train.coordinator，tf.train.start_queue_runners(coord=coord)或tf.summary.FileWriter(self._output_dir)会导致此错误吗？谢谢

ERROR

Model/g_B/c6/Conv/weights:0
Model/g_B/c6/Conv/biases:0
2017-11-29 11:20:40.825993: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 
2017-11-29 11:20:40.826013: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.826017: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.826020: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.826040: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-29 11:20:40.955576: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-29 11:20:40.956141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1050 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.62
pciBusID 0000:01:00.0
Total memory: 3.94GiB
Free memory: 3.56GiB
2017-11-29 11:20:40.956177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-11-29 11:20:40.956188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-11-29 11:20:40.956207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0)
hereheeeheheheeheh
2017-11-29 11:20:40.958641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0)
('In the epoch ', 0)
Saving image 0/20
Saving image 1/20
Saving image 2/20
Saving image 3/20
Saving image 4/20
Saving image 5/20
Saving image 6/20
Saving image 7/20
Saving image 8/20
Saving image 9/20
Saving image 10/20
Saving image 11/20
Saving image 12/20
Saving image 13/20
Saving image 14/20
Saving image 15/20
Saving image 16/20
Saving image 17/20
Saving image 18/20
Saving image 19/20
Processing batch 0/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 1/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 2/200
Garbage collector: collected 0 objects.

**ETC**

Processing batch 194/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 195/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 196/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 197/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 198/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 199/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
('In the epoch ', 1)
Saving image 0/20
Saving image 1/20
Saving image 2/20
Saving image 3/20
Saving image 4/20
Saving image 5/20
Saving image 6/20
Saving image 7/20
Saving image 8/20
Saving image 9/20
Saving image 10/20
Saving image 11/20
Saving image 12/20
Saving image 13/20
Saving image 14/20
Saving image 15/20
Saving image 16/20
Saving image 17/20
Saving image 18/20
Saving image 19/20
Processing batch 0/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 1/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 2/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 3/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 4/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 5/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 6/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 7/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 8/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 9/200
Garbage collector: collected 0 objects.
('lets see:   ', None)
Processing batch 10/200
2017-11-29 11:25:06.741162: E tensorflow/stream_executor/cuda/cuda_driver.cc:955] failed to alloc 4294967296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-11-29 11:25:06.742407: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 4294967296
2017-11-29 11:25:06.742620: E tensorflow/stream_executor/cuda/cuda_driver.cc:955] failed to alloc 3865470464 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-11-29 11:25:06.742630: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 3865470464
2017-11-29 11:25:06.742826: E tensorflow/stream_executor/cuda/cuda_driver.cc:955] failed to alloc 3478923264 bytes on host: CUDA_ERROR_OUT_OF_MEMORY
2017-11-29 11:25:06.742835: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 3478923264
Killed

培训方法代码

def train(self):
    """Training Function."""
    # Load Dataset from the dataset folder
    self.inputs = data_loader.load_data(
        self._dataset_name, self._size_before_crop,
        True, self._do_flipping)

    # Build the network
    self.model_setup()

    # Loss function calculations
    self.compute_losses()

    # Initializing the global variables
    init = (tf.global_variables_initializer(),
            tf.local_variables_initializer())
    saver = tf.train.Saver()

    max_images = cyclegan_datasets.DATASET_TO_SIZES[self._dataset_name]
    #gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.5)

    print("hereheeeheheheeheh")
    config=tf.ConfigProto()
    config.gpu_options.allow_growth=True




    with tf.Session(config=config) as sess:
        sess.run(init)

        # Restore the model to run the model from last checkpoint
        if self._to_restore:
            chkpt_fname = tf.train.latest_checkpoint(self._checkpoint_dir)
            saver.restore(sess, chkpt_fname)

        writer = tf.summary.FileWriter(self._output_dir)

        if not os.path.exists(self._output_dir):
            os.makedirs(self._output_dir)

        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)

        # Training Loop
        for epoch in range(sess.run(self.global_step), self._max_step):
            print("In the epoch ", epoch)
            saver.save(sess, os.path.join(
                self._output_dir, "cyclegan"), global_step=epoch)

            # Dealing with the learning rate as per the epoch number
            if epoch < 100:
                curr_lr = self._base_lr
            else:
                curr_lr = self._base_lr - \
                    self._base_lr * (epoch - 100) / 100

            self.save_images(sess, epoch)

            for i in range(0, max_images):
                print("Processing batch {}/{}".format(i, max_images))

                inputs = sess.run(self.inputs)

                # Optimizing the G_A network
                _, fake_B_temp, summary_str = sess.run(
                    [self.g_A_trainer,
                     self.fake_images_b,
                     self.g_A_loss_summ],
                    feed_dict={
                        self.input_a:
                            inputs['images_i'],
                        self.input_b:
                            inputs['images_j'],
                        self.learning_rate: curr_lr
                    }
                )
                writer.add_summary(summary_str, epoch * max_images + i)

                fake_B_temp1 = self.fake_image_pool(
                    self.num_fake_inputs, fake_B_temp, self.fake_images_B)

                # Optimizing the D_B network
                _, summary_str = sess.run(
                    [self.d_B_trainer, self.d_B_loss_summ],
                    feed_dict={
                        self.input_a:
                            inputs['images_i'],
                        self.input_b:
                            inputs['images_j'],
                        self.learning_rate: curr_lr,
                        self.fake_pool_B: fake_B_temp1
                    }
                )
                writer.add_summary(summary_str, epoch * max_images + i)

                # Optimizing the G_B network
                _, fake_A_temp, summary_str = sess.run(
                    [self.g_B_trainer,
                     self.fake_images_a,
                     self.g_B_loss_summ],
                    feed_dict={
                        self.input_a:
                            inputs['images_i'],
                        self.input_b:
                            inputs['images_j'],
                        self.learning_rate: curr_lr
                    }
                )
                writer.add_summary(summary_str, epoch * max_images + i)

                fake_A_temp1 = self.fake_image_pool(
                    self.num_fake_inputs, fake_A_temp, self.fake_images_A)

                # Optimizing the D_A network
                _, summary_str = sess.run(
                    [self.d_A_trainer, self.d_A_loss_summ],
                    feed_dict={
                        self.input_a:
                            inputs['images_i'],
                        self.input_b:
                            inputs['images_j'],
                        self.learning_rate: curr_lr,
                        self.fake_pool_A: fake_A_temp1
                    }
                )
                writer.add_summary(summary_str, epoch * max_images + i)

                writer.flush()
                collected = gc.collect()
                print("Garbage collector: collected %d objects." % (collected))
                print("lets see:   ",writer.flush())
                self.num_fake_inputs += 1

            sess.run(tf.assign(self.global_step, epoch + 1))

        coord.request_stop()
        coord.join(threads)
        writer.add_graph(sess.graph)

GPU不会在张量流上的每个时期释放资源

0 个答案: