Question

我使用tensorflow profiler来分析我的模型，以查看每个操作消耗了多少时间。我发现一些奇怪的行为，例如，放置在GPU上的Conv2D操作（我将log_device_placement=True设置为查看放置位置）也具有大量的CPU执行时间。这是我用来进行性能分析的代码（tensorflow 1.4.0）：

import tensorflow as tf
from tensorflow.python.profiler import option_builder

builder = option_builder.ProfileOptionBuilder
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
# run and collect metadata
my_session.run(fetch_something, feed_dict=feed_dict, 
  options=run_options, run_metadata=run_metadata)
profiler_opts = builder(builder.time_and_memory()).order_by('micros').build()
# this will output the following results
tf.profiler.profile(my_graph, run_meta=run_metadata, cmd='scope', options=profiler_opts)

这是探查器的输出：

node name | requested bytes | total execution time | accelerator execution time | cpu execution time
MyScope/Conv2D (4511.35MB/4511.35MB, 823.47ms/823.47ms, 445.37ms/445.37ms, 378.11ms/378.11ms)

从分析结果来看，Conv2D操作（tf.nn.conv2d）在CPU上花费378.11毫秒，在GPU上花费445.37毫秒。为什么张量流不只将GPU用于Conv2D？是因为此操作占用大量内存（4511.35MB），是在内存和GPU之间进行数据传输的CPU时间吗？

========更新=======

我刚刚发现的另一种现象。当Conv2D的“请求的字节”很大时（在我的情况下> 4GB），CPU执行时间会很长（大约400〜500ms）。当“请求的字节”很小（以我的情况为1.5GB）时，CPU执行时间很短（大约15ms）。我猜Conv2D的CPU执行时间与内存消耗有关。但是，我不为什么在不同批次（my_session.run）中，Conv2D使用不同数量的“请求字节”。在不同批次中，应用Conv2D的张量具有几乎相同的大小。

Answer 1

尽管我看不到您的整个图表，但我假设您连续向feed_dict馈送数据。
因此，每次评估张量时，它们都会使用基础数据集中的 next元素的值。这也需要花费 CPU 的时间。如果您有足够的空间通过tf.Tensor对象保存数据，则可以直接从 GPU 内存中馈送数据，请参见documentation：

如果所有输入数据都适合存储在内存中，则最简单的方法是创建一个来自它们的数据集是将它们转换为tf.Tensor对象并使用数据集。from_tensor_slices（）。

tensorflow documentation的相应部分中的示例：

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

请注意，以上代码段将嵌入功能部件和标签 TensorFlow图中的数组作为tf.constant（）操作。这个对于较小的数据集，效果很好，但浪费了内存-因为数组的内容将被复制多次-并可以运行到 tf.GraphDef协议缓冲区的2GB限制。

但事实并非如此。因此，根据您提供的信息，我认为CPU消耗主要（或完全）归因于数据馈送操作与该图的下一个输入。

为什么放置在GPU上的操作也要在CPU（张量流）上执行？

1 个答案: