torch.profiler¶

概述 ¶

PyTorch Profiler 是一个工具，允许在训练和推理过程中收集性能指标。Profiler 的上下文管理器 API 可以用来更好地理解哪些模型操作最昂贵，检查它们的输入形状和堆栈跟踪，研究设备内核活动并可视化执行跟踪。

注意

API 的早期版本在 torch.autograd 模块中被认为是遗留版本，并将被弃用。

API 参考指南

class torch.profiler._KinetoProfile(*, activities=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None, execution_trace_observer=None, acc_events=False, custom_trace_id_callback=None)[source][source]¶

低级分析器封装了自动微分分析器。

参数:

activities (可迭代) - 在分析中使用的活动组列表（CPU、CUDA），支持值： torch.profiler.ProfilerActivity.CPU ， torch.profiler.ProfilerActivity.CUDA ， torch.profiler.ProfilerActivity.XPU 。默认值：ProfilerActivity.CPU 和（当可用时）ProfilerActivity.CUDA 或（当可用时）ProfilerActivity.XPU。
record_shapes (bool) – 保存操作符输入形状信息。
profile_memory (bool) – 跟踪张量内存分配/释放（见 export_memory_timeline 了解更多细节）。
with_stack (bool) – 记录操作符的源信息（文件和行号）。
with_flops (bool) – 使用公式估计特定操作符的 FLOPS（矩阵乘法和 2D 卷积）。
with_modules (bool) – 记录操作调用栈对应的模块层次结构（包括函数名）。例如，如果模块 A 的前向调用模块 B 的前向，其中包含 aten::add 操作，则 aten::add 的模块层次结构是 A.B 注意，目前此支持仅适用于 TorchScript 模型，而不适用于 eager 模式模型。
experimental_config (_ExperimentalConfig) – 一组由 Kineto 等分析库使用的实验性选项。注意，不保证向后兼容性。
execution_trace_observer (ExecutionTraceObserver) – PyTorch 执行跟踪观察者对象。PyTorch 执行跟踪提供基于图的 AI/ML 工作负载表示，并支持回放基准测试、模拟器和仿真器。当包含此参数时，观察者的 start() 和 stop() 方法将与 PyTorch 分析器的相同时间窗口调用。
acc_events (bool) – 启用在多个分析周期内累积 FunctionEvents。

注意

此 API 为实验性，未来可能发生变化。

启用形状和堆栈跟踪会导致额外的开销。当指定 record_shapes=True 时，分析器将暂时保留张量的引用；这可能会进一步防止依赖于引用计数的某些优化，并引入额外的张量副本。

add_metadata(key, value)[source][source]¶

将用户定义的元数据（字符串键和字符串值）添加到跟踪文件中

add_metadata_json(key, value)[source][source]

将用户定义的元数据（字符串键和有效的 JSON 值）添加到跟踪文件中

events()[source][source]: 返回未聚合的剖析事件列表，用于跟踪回调或剖析完成后

export_chrome_trace(path)[source][source]¶

以 Chrome JSON 格式导出收集到的跟踪信息。如果启用 kineto，则仅导出计划中的最后一个周期。

export_memory_timeline(path, device=None)[source][source]¶

从给定设备的分析器收集的树中导出内存事件信息，并导出时间线图。使用 export_memory_timeline 可以导出 3 个文件，每个文件由 path 的后缀控制。

对于兼容 HTML 的图表，使用后缀 .html ，内存时间线图将作为 PNG 文件嵌入到 HTML 文件中。
对于由 [times, [sizes by category]] 组成的图表点，其中 times 是时间戳， sizes 是每个类别的内存使用情况。内存时间线图将保存为 JSON（ .json ）或 gzip 压缩的 JSON（ .json.gz ），具体取决于后缀。
对于原始内存点，使用后缀 .raw.json.gz 。每个原始内存事件将包含 (timestamp, action, numbytes, category) ，其中 action 是 [PREEXISTING, CREATE, INCREMENT_VERSION, DESTROY] 之一， category 是 torch.profiler._memory_profiler.Category 中的枚举之一。

输出：以 gzip 压缩的 JSON、JSON 或 HTML 格式编写的内存时间线。

export_stacks(path, metric='self_cpu_time_total')[source][source]¶

将堆栈跟踪保存到文件

参数:

path (str) – 将堆栈文件保存到该位置；
metric (str) – 要使用的指标：“self_cpu_time_total”或“self_cuda_time_total”

key_averages(group_by_input_shape=False, group_by_stack_n=0, group_by_overload_name=False)[source][source]¶

平均事件，按操作符名称和（可选）输入形状、堆栈和重载名称进行分组。

注意

要使用形状/堆栈功能，请确保在创建分析器上下文管理器时设置 record_shapes/with_stack。

preset_metadata_json(key, value)[source][source]¶

在分析器未启动时预设用户定义的元数据，并在稍后添加到跟踪文件中。元数据格式为字符串键和有效的 JSON 值

toggle_collection_dynamic(enable, activities)[source][source]¶

在任何收集点切换活动的收集开关。目前支持切换 Torch Ops（CPU）和 Kineto 支持的 CUDA 活动

参数:: 活动（可迭代）- 要在分析中使用的活动组列表，支持的值： torch.profiler.ProfilerActivity.CPU ， torch.profiler.ProfilerActivity.CUDA

示例：

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile_0()
    // turn off collection of all CUDA activity
    p.toggle_collection_dynamic(False, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_1()
    // turn on collection of all CUDA activity
    p.toggle_collection_dynamic(True, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_2()
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1))

class torch.profiler.profile(*, activities=None, schedule=None, on_trace_ready=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None, execution_trace_observer=None, acc_events=False, use_cuda=None, custom_trace_id_callback=None)[source][source]¶

分析器上下文管理器。

参数:

activities (可迭代对象) – 用于分析的活动组列表（CPU、CUDA），支持值： torch.profiler.ProfilerActivity.CPU ， torch.profiler.ProfilerActivity.CUDA ， torch.profiler.ProfilerActivity.XPU 。默认值：ProfilerActivity.CPU 和（当可用时）ProfilerActivity.CUDA 或（当可用时）ProfilerActivity.XPU。
schedule (可调用对象) – 可调用对象，它接受步长（int）作为单个参数，并在每个步骤返回指定分析器操作的值 ProfilerAction 。
on_trace_ready (Callable) – 在每个步骤中，当 schedule 返回 ProfilerAction.RECORD_AND_SAVE 时调用的可调用函数。
record_shapes (bool) – 保存操作符输入形状的信息。
profile_memory (bool) – 跟踪张量的内存分配/释放。
with_stack (bool) – 记录操作符的源信息（文件和行号）。
with_flops (bool) – 使用公式估计特定运算符（矩阵乘法和二维卷积）的 FLOPs（浮点运算）。
with_modules (bool) – 记录操作调用栈对应的模块层次结构（包括函数名）。例如，如果模块 A 的前向调用模块 B 的前向，其中包含 aten::add 操作，则 aten::add 的模块层次结构是 A.B 注意，目前此支持仅适用于 TorchScript 模型，而不适用于 eager 模式模型。
experimental_config (_ExperimentalConfig) – 用于 Kineto 库功能的实验选项集。注意，不保证向后兼容性。
execution_trace_observer (ExecutionTraceObserver) – PyTorch 执行跟踪观察者对象。PyTorch 执行跟踪提供 AI/ML 工作负载的基于图的表示，并启用重放基准测试、模拟器和仿真器。当包含此参数时，观察者的 start()和 stop()将在与 PyTorch 分析器相同的时间窗口内被调用。下面示例部分提供了代码示例。
acc_events (bool) – 启用在多个分析周期内累积 FunctionEvents。
use_cuda (bool) –

自 1.8.1 版本以来已弃用：请使用 activities 代替。

注意

使用 schedule() 生成可调用的调度。非默认调度在分析长时间训练作业时非常有用，允许用户在训练过程的各个迭代中获取多个跟踪。默认调度简单地记录所有事件，在上下文管理器持续时间内连续记录。

注意

使用 tensorboard_trace_handler() 为 TensorBoard 生成结果文件：

on_trace_ready=torch.profiler.tensorboard_trace_handler(dir_name)

分析完成后，结果文件可以在指定的目录中找到。请使用以下命令：

tensorboard --logdir dir_name

在 TensorBoard 中查看结果。有关更多信息，请参阅 PyTorch Profiler TensorBoard 插件

注意

启用形状和堆栈跟踪会导致额外的开销。当指定 record_shapes=True 时，分析器将暂时保留张量的引用；这可能会进一步防止依赖于引用计数的某些优化，并引入额外的张量副本。

示例：

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile()
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1))

使用分析器的 schedule 、 on_trace_ready 和 step 函数：

# Non-default profiler schedule allows user to turn profiler on and off
# on different iterations of the training loop;
# trace_handler is called every time a new trace becomes available
def trace_handler(prof):
    print(prof.key_averages().table(
        sort_by="self_cuda_time_total", row_limit=-1))
    # prof.export_chrome_trace("/tmp/test_trace_" + str(prof.step_num) + ".json")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],

    # In this example with wait=1, warmup=1, active=2, repeat=1,
    # profiler will skip the first step/iteration,
    # start warming up on the second, record
    # the third and the forth iterations,
    # after which the trace will become available
    # and on_trace_ready (when set) is called;
    # the cycle repeats starting with the next step

    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2,
        repeat=1),
    on_trace_ready=trace_handler
    # on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
    # used when outputting for tensorboard
    ) as p:
        for iter in range(N):
            code_iteration_to_profile(iter)
            # send a signal to the profiler that the next iteration has started
            p.step()

以下示例展示了如何设置执行跟踪观察器（execution_trace_observer）

with torch.profiler.profile(
    ...
    execution_trace_observer=(
        ExecutionTraceObserver().register_callback("./execution_trace.json")
    ),
) as p:
    for iter in range(N):
        code_iteration_to_profile(iter)
        p.step()

你也可以参考 tests/profiler/test_profiler.py 中的 test_execution_trace_with_kineto()。注意：也可以传递任何满足 _ITraceObserver 接口的对象。

get_trace_id()[source][source]: 返回当前的跟踪 ID。

set_custom_trace_id_callback(callback)[source][source]: 设置当生成新的跟踪 ID 时调用的回调。

step()[source][source]¶: 通知分析器下一个分析步骤已开始。

class torch.profiler.ProfilerAction(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source][source]¶: 在指定间隔内可以执行的分析器操作。

torch.profiler.ProfilerActivity 类

成员：

CPU

XPU

MTIA

CUDA

HPU

私有用途 1

属性名称¶

torch.profiler.schedule(*, wait, warmup, active, repeat=0, skip_first=0, skip_first_wait=0)[source][source]¶

返回一个可调用的对象，可以用作分析器 schedule 参数。分析器将跳过前 skip_first 步骤，然后等待 wait 步骤，然后进行下一个 warmup 步骤的预热，然后进行下一个 active 步骤的活跃记录，然后从 wait 步骤开始重复循环。可选的循环次数由 repeat 参数指定，零值表示循环将一直继续，直到分析完成。

第 skip_first_wait 参数控制是否跳过第一个 wait 阶段。如果用户想要在循环之间等待的时间长于 skip_first ，但第一个配置文件不需要，这可能会很有用。例如，如果 skip_first 是 10， wait 是 20，那么如果 skip_first_wait 为零，第一个循环将在预热前等待 10 + 20 = 30 步，但如果 skip_first_wait 非零，则只等待 10 步。然后所有后续循环将在最后一个活动和预热之间等待 20 步。

返回类型:: 可调用

torch.profiler.tensorboard_trace_handler(dir_name, worker_name=None, use_gzip=False)[source][source]

将跟踪文件输出到 dir_name 的目录，然后可以直接将此目录作为 logdir 传递给 tensorboard。在分布式场景中， worker_name 应为每个工作者的唯一标识，默认情况下将设置为‘[hostname]_[pid]’。

英特尔仪器和跟踪技术 API

torch.profiler.itt.is_available()[source][source]¶: 检查 ITT 功能是否可用

torch.profiler.itt.mark(msg)[source][source]¶

描述在某个时刻发生的瞬时事件

参数:: msg (str) – 与事件关联的 ASCII 消息。

torch.profiler.itt.range_push(msg)[source][source]¶

将一个范围推入嵌套范围栈。返回开始的范围的零基深度。

参数:: msg (str) – 与范围关联的 ASCII 消息。

torch.profiler.itt.range_pop()[source][source]¶: 从嵌套范围跨度堆栈中弹出范围。返回结束的范围的零基深度。

torch.profiler¶

概述 ¶

API 参考指南

英特尔仪器和跟踪技术 API

文档

教程

资源