How to profile TensorFlow

最新推荐文章于 2025-05-21 01:00:24 发布

黑暗星球

最新推荐文章于 2025-05-21 01:00:24 发布

阅读量4.1k

点赞数 1

分类专栏： TensorFlow教程文章标签： TensorFlow profile

TensorFlow教程专栏收录该内容

37 篇文章

订阅专栏

本文详细介绍如何使用TensorFlow的timeline模块进行程序性能剖析，包括单个run的剖析、多个run的合并以及常见问题的解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

TensorFlow是当前最常用的机器学习库之一。对TensorFlow Graph进行profile并了解各个op的耗时情况对于程序性能的提升非常有用。TensorFlow程序的profile可以通过tensorflow的timeline模块完成。但网上找不到任何明确的教程。所以在这篇博文中，我将尝试按以下主题来介绍tensorflow程序的profile：

如何对TensorFlow程序进行profile
如何合并由多个run产生的timeline
在profile中可能出现的问题及解决方案

简单示例

我们首先用一个非常简单的例子来感受下timeline模块的魅力。该例子来自StackOverflow answer。

# https://github.com/ikhlestov/tensorflow_profiling/01_simple_example.py of Github

import tensorflow as tf
from tensorflow.python.client import timeline

a = tf.random_normal([200, 500])
b = tf.random_normal([500, 100])
res = tf.matmul(a, b)

with tf.Session() as sess:
    # add additional options to trace the session execution
    options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(res, options=options, run_metadata=run_metadata)

    # Create the Timeline object, and write it to a json file
    fetched_timeline = timeline.Timeline(run_metadata.step_stats)
    chrome_trace = fetched_timeline.generate_chrome_trace_format()
    with open('timeline_01.json', 'w') as f:
        f.write(chrome_trace)

你应该注意到了，我们为session.run指定了options和run_metadata参数。在执行后，会产生一个 timeline_01.json 文件（以Chrome trace格式）。如果该脚本在你的机器上无法运行，可尝试profile中可能出现的问题及解决方案部分的第一个解决方案。

timeline存储数据的格式是Chrome tracing格式，因此只能使用Chrome浏览器来查看存储的数据。打开Chrome浏览器，在地址栏输入 chrome://tracing，在左上部，你将看到一个Load按钮，通过其来加载我们前面产生的JSON文件。
在这里插入图片描述
在上图的顶部，你将看到以ms为单位的时间轴。想要查看关于op的更多信息，点击对应时间片段即可。另外页面右边有一些简单的工具：selection、pan、zoom、timing。

复杂点的示例

现在，让我们来研究一下稍微复杂点的例子：

# https://github.com/ikhlestov/tensorflow_profiling/02_example_with_placeholders_and_for_loop.py of Github

import os
import tempfile

import tensorflow as tf
from tensorflow.contrib.layers import fully_connected as fc
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.python.client import timeline

batch_size = 100

# placeholder
inputs = tf.placeholder(tf.float32, [batch_size, 784])
targets = tf.placeholder(tf.float32, [batch_size, 10])

# model
fc_1_out = tf.layers.dense(inputs, 500, activation=tf.nn.sigmoid)
fc_2_out = tf.layers.dense(fc_1_out, 784, activation=tf.nn.sigmoid)
logits = tf.layers.dense(fc_2_out, 10, activation=None)

# loss
loss = tf.losses.softmax_cross_entropy(onehot_labels=targets, logits=logits)
# train_op
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)


if __name__ == '__main__':
    mnist_save_dir = os.path.join(tempfile.gettempdir(), 'MNIST_data')
    mnist = input_data.read_data_sets(mnist_save_dir, one_hot=True)

    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    with tf.Session(config=config) as sess:
        sess.run(tf.global_variables_initializer())

        run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
        run_metadata = tf.RunMetadata()
        for i in range(3):
            batch_input, batch_target = mnist.train.next_batch(batch_size)
            feed_dict = {inputs: batch_input,
                         targets: batch_target}

            sess.run(train_op,
                     feed_dict=feed_dict,
                     options=run_options,
                     run_metadata=run_metadata)

            fetched_timeline = timeline.Timeline(run_metadata.step_stats)
            chrome_trace = fetched_timeline.generate_chrome_trace_format()
            with open('timeline_02_step_%d.json' % i, 'w') as f: # 将每个run的trace单独存为一个json文件
                f.write(chrome_trace)

在上面的例子中，我们使用tf.variable_scope对变量进行了管理。通过这个操作，timeline的显示将更加清晰明了。

另外，我们的代码以三个文件的形式保存了3个run的trace。如果我们在CPU执行这段程序，我们将得到下图所示的timelines（根据电脑不同，可能略有差别）：
在这里插入图片描述
如果在GPU运行该段代码，你将发现第1个run的trace结果与后面不同。

你可能会注意到，在GPU上的首次运行比后续运行耗时长的多。发生这种情况是因为在第一次运行时，tensorflow执行一些GPU初始化，之后它们将被优化。如果您想要更准确的时间线，则应在100次左右的运行后存储trace。

此外，现在所有的 incoming/outcoming 流都以variable scope名称开始，并且我们确切地知道源代码中各个op的位置。

将多个run的timeline存储在一个文件

由于某些原因，我们想将多个session run的timeline存储到一个文件中，应该怎么做呢？

很不幸，这只能手动来完成。在Chrome trace格式中，有每一个事件的定义和其的运行时间。在第一个run中，我们保存了所有的数据，但是在后续的run中，我们将只更新运行时间。

# https://github.com/ikhlestov/tensorflow_profiling/blob/master/03_merged_timeline_example.py

import os
import tempfile
import json

import tensorflow as tf
from tensorflow.contrib.layers import fully_connected as fc
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.python.client import timeline


class TimeLiner:
    _timeline_dict = None

    def update_timeline(self, chrome_trace):
        # convert crome trace to python dict
        chrome_trace_dict = json.loads(chrome_trace)
        # for first run store full trace
        if self._timeline_dict is None:
            self._timeline_dict = chrome_trace_dict
        # for other - update only time consumption, not definitions
        else:
            for event in chrome_trace_dict['traceEvents']:
                # events time consumption started with 'ts' prefix
                if 'ts' in event:
                    self._timeline_dict['traceEvents'].append(event)

    def save(self, f_name):
        with open(f_name, 'w') as f:
            json.dump(self._timeline_dict, f)


batch_size = 100

# placeholder
inputs = tf.placeholder(tf.float32, [batch_size, 784])
targets = tf.placeholder(tf.float32, [batch_size, 10])

# model
fc_1_out = tf.layers.dense(inputs, 500, activation=tf.nn.sigmoid)
fc_2_out = tf.layers.dense(fc_1_out, 784, activation=tf.nn.sigmoid)
logits = tf.layers.dense(fc_2_out, 10, activation=None)

# loss
loss = tf.losses.softmax_cross_entropy(onehot_labels=targets, logits=logits)
# train_op
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

if __name__ == '__main__':
    mnist_save_dir = os.path.join(tempfile.gettempdir(), 'MNIST_data')
    mnist = input_data.read_data_sets(mnist_save_dir, one_hot=True)

    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    with tf.Session(config=config) as sess:
        sess.run(tf.global_variables_initializer())

        options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
        run_metadata = tf.RunMetadata()
        many_runs_timeline = TimeLiner()
        runs = 5
        for i in range(runs):
            batch_input, batch_target = mnist.train.next_batch(batch_size)
            feed_dict = {inputs: batch_input,
                         targets: batch_target}

            sess.run(train_op,
                     feed_dict=feed_dict,
                     options=options,
                     run_metadata=run_metadata)

            fetched_timeline = timeline.Timeline(run_metadata.step_stats)
            chrome_trace = fetched_timeline.generate_chrome_trace_format()
            many_runs_timeline.update_timeline(chrome_trace)
        many_runs_timeline.save('timeline_03_merged_%d_runs.json' % runs)

然后我们就得到了合并后的timeline：
在这里插入图片描述

在profile中可能出现的问题及解决方案

profile期间可能出现一些问题。首先，它可能不工作。如果你遇到了这下面的错误：

I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH:

你可以安装 libcupti-dev 来解决该问题：

sudo apt-get install libcupti-dev

第二个常见的错误是运行延迟。在最后一张图中，我们可以看到在run之间有一个gap。对于大网络，这个gap可能会很长。这个bug不能完全解决，但使用custom C++ protobuf 库可以减少延迟。这在TF的官方文档中有叙述。

两个run之间的gap的形成原因：由于我们以Python代码的串行方式来实现每个step的timeline的保存，因此这个gap不可避免，如果直接用TensorFlow后端C++ engine以并行的方式保存每个step的timeline，将彻底消除这个gap。

结论

希望通过上面的分析，你对TensorFlow的profile过程应该已经掌握。本文使用的所有代码都可以在这个repo中找到。

文档来源：
本文翻译自 https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d
上述文档的其他翻译有：https://walsvid.github.io/2017/03/25/profiletensorflow/