分析以了解 torch.compile 性能

torch.profiler 的使用：如何使用

torch.profiler 有助于以内核级粒度了解您程序的性能 - 例如，它可以显示程序级别的图断点和资源利用率。分析器提供的数据通常可以帮助用户了解进一步调查以了解模型性能的地方。

要理解内核级性能，可以使用其他工具，例如 Nvidia Nsight compute 工具、AMD Omnitrace、Intel® VTune™ Profiler 或电感器的分析工具。

参见 PyTorch 分析器的通用指南。

使用 torch.profiler 的基本方法和查看跟踪信息

示例程序：我们将使用这个分析 resnet18 的示例。注意以下示例程序的各个部分：

包含一个预热运行以等待编译完成（这将预热系统，如 CUDA 缓存分配器）
使用 torch.profiler.profile() 来分析我们感兴趣的章节
使用 prof.export_chrome_trace("trace.json") 导出分析结果

import torch
from torchvision.models import resnet18

device = 'cuda'      # or 'cpu', 'xpu', etc.
model = resnet18().to(device)

inputs = [torch.randn((5, 3, 224, 224), device=device) for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

# warm up
fwd_bwd(inputs[0])

with torch.profiler.profile() as prof:
    for i in range(1, 4):
        fwd_bwd(inputs[i])
        prof.step()

prof.export_chrome_trace("trace.json")

查看 Chrome 追踪：在 Chrome 浏览器中打开 chrome://tracing 并加载 json 文件。使用“w”和“s”键进行缩放，使用“a”和“d”键左右滚动。“？”将显示一个包含快捷键列表的“帮助”屏幕。

Example of a basic chrome trace, visualized in the chrome://tracing viewer

这里，我们观察到：* 编译函数和编译函数反向事件，它们对应于 dynamo 编译区域。* 顶部为 CPU 事件，底部为 GPU 事件。

CPU 和加速器事件之间的流

每个在加速器上运行的内核都是在 CPU 上运行的代码启动之后发生的。分析器可以绘制加速器和 CPU 事件之间的连接（即“流”），以显示哪个 CPU 事件启动了加速器内核。这特别有用，因为除了少数例外，加速器内核都是异步启动的。

要查看流连接，请单击 GPU 内核并点击“ac2g”：

Visualization in the chrome://trace viewer, showing an async flow between a kernel and its launching location.

或者，通过顶部“流事件”下拉菜单打开所有流。

解决 CUDA 图分析问题

当启用 CUDA 图时，一些 CUDA 配置（驱动程序版本低于 525.85.12 或 CUDA < 12）可能会在分析工具和 CUDA 图之间遇到问题。要修复这些问题，请在程序顶部添加一个空的分析上下文：

import torch

torch.profiler._utils._init_for_cuda_graphs()

# ... rest of program

理解编译时间

要理解为什么编译耗时较长，您可以分析 torch.compile-ed 程序首次调用的性能。请注意，编译的性能分析可能会比典型分析扭曲得更严重，因为编译的工作负载可能与典型的 PyTorch 工作负载有很大不同。在某些情况下，跟踪文件也可能非常大。大于 1GB 的跟踪文件可能难以用 chrome 跟踪工具打开。

注意：大致相同的信息也可以通过 torch._dynamo.utils.compile_times() 以非图形格式获得。此实用程序不会显示编译步骤何时发生，但它会显示每个步骤花费的时间——时间不会受到任何分析开销的影响。

以下是一个示例：

import torch
from torchvision.models import resnet18

# user can switch between cuda and xpu
device = 'cuda'
model = resnet18().to(device)
inputs = [torch.randn((5, 3, 224, 224), device=device) for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

def warmup_compile():
    def fn(x):
        return x.sin().relu()

    x = torch.rand((2, 2), device=device, requires_grad=True)
    fn_c = torch.compile(fn)
    out = fn_c(x)
    out.sum().backward()

with torch.profiler.profile() as prof:
    with torch.profiler.record_function("warmup compile"):
        warmup_compile()

    with torch.profiler.record_function("resnet18 compile"):
        fwd_bwd(inputs[0])

prof.export_chrome_trace("trace_compile.json")

A visualization in the chrome://trace viewer, showing dynamo and inductor compilation steps

注意以下几点：

首次调用应在分析期间进行，以便捕获编译过程
添加预热编译以初始化需要懒加载的系统。

寻找图断点：“Torch-Compiled Region”和“CompiledFunction”

尽管有用于识别图断点的日志工具，但分析器提供了一个快速的可视化方法来识别图断点。有两个分析器事件需要查找：Torch-Compiled Region 和 CompiledFunction。

火炬编译区域 - 自 PyTorch 2.2 版本引入 - 是一个覆盖整个编译区域的性能分析事件。图断点几乎总是看起来相同：嵌套的“火炬编译区域”事件。

如果您对两个独立的函数分别应用 torch.compile()，通常应该看到两个相邻（即非堆叠/嵌套）的火炬编译区域。同时，如果您遇到图断点（或禁用/跳过的区域），则预期会出现嵌套的“火炬编译区域”事件。

编译函数 - 自 PyTorch 2.0 版本引入 - 是一个当需要任何输入的梯度时出现的性能分析事件。每个图断点都会中断一个编译函数块，将其分成两部分。编译函数事件仅在涉及 Autograd 时出现，即图中的某些输入张量具有 requires_grad=True。

当编译函数出现在跟踪中时，它通常与反向传递中的编译函数反向事件配对。如果调用反向函数，则跟踪中应出现一个“前向-反向链接”连接这两个事件。

如果您的用例中包含一个不需要 grad 且不包含“Torch-Compiled Region”事件的图，那么可能更难判断 torch.compile 是否被正确应用。一个线索可能是存在由 Inductor 生成的 Triton 内核。

请参见下面的合成示例以进行演示：

import torch
import torch._dynamo
# user can switch between cuda and xpu
device = 'cuda'

class ModelWithBreaks(torch.nn.Module):
    def __init__(self):
        super().__init__()
        def create_sequential():
            return torch.nn.Sequential(
                torch.nn.Linear(128, 128),
                torch.nn.ReLU(),
                torch.nn.Linear(128, 128),
                torch.nn.ReLU(),
            )
        self.mod1 = create_sequential()
        self.mod2 = create_sequential()
        self.mod3 = create_sequential()
        self.mod4 = create_sequential()

    def forward(self, inp):
        mod1 = self.mod1(inp)
        torch._dynamo.graph_break()
        mod2 = self.mod2(mod1)
        torch._dynamo.graph_break()
        mod3 = self.mod3(mod2)
        torch._dynamo.graph_break()
        mod4 = self.mod4(mod3)
        return mod4

model = ModelWithBreaks().to(device)
inputs = [torch.randn((128, 128), device=device) for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

# warm up
fwd_bwd(inputs[0])

with torch.profiler.profile() as prof:
    for i in range(1, 4):
        fwd_bwd(inputs[i])
        prof.step()

prof.export_chrome_trace("trace_break.json")

Visualization in the chrome://trace viewer, showing nested Torch-Compiled Region events and multiple CompiledFunction events - indicating graph breaks.

运算符内核

当一个运算符被启动时，我们期望看到一些事件：

CPU 端事件
内核启动（如果处理 GPU 内核）
GPU 端事件

Visualization in the chrome://trace viewer, showing the three types of events: CPU-side event, kernel launch, and GPU-side event

诱导生成的 Triton 内核：1. CPU 端事件应显示为以“triton_”为前缀的事件。当前事件信息较少 - 内核名称和启动，但比典型的 aten 内核启动信息少（典型的 aten 内核启动包含输入形状、类型等）。2. 内核启动应显示为 cuLaunchKernel，而不是 cudaLaunchKernel（cudaLaunchKernel 是 aten 操作的典型用法）。3. GPU 端事件应显示，事件名称的描述性取决于 inductor 的 unique_kernel_names 配置。

非电感器生成的 Triton 内核：

CPU 端的事件可能不会出现在跟踪中；自动插入分析器事件的机制目前仅在 Inductor 级别实现，因此绕过 Inductor 的 Triton 内核可能不会出现在跟踪中，除非用户手动注释
内核启动应显示为 s cuLaunchKernel，而不是 cudaLaunchKernel（cudaLaunchKernel 通常是 aten 操作的典型用法）
GPU 端的事件应该出现，名称应与编写的 Triton 内核类似

感应器生成的 CPU 内核：

CPU 端的事件不会出现在跟踪中；我们尚未为此添加分析。
内核启动和 GPU 端的事件不存在

非 Triton 内核（即 aten 内核或自定义操作）有时也可能会出现在跟踪中。有时，Inductor 会回退到原始操作实现，在这种情况下，您将看到对 aten 操作的调用。

启动开销 ¶

一个常见问题是 GPU 利用率低。快速识别这一问题的方法是查看 GPU 内核之间是否存在较大的间隔：

Visualization in the chrome://trace viewer, showing large gaps between GPU kernels. This indicates that the model is CPU bound, likely due to overhead during kernel launches.

这通常是 CPU 开销的结果，例如，如果内核启动之间在 CPU 上花费的时间大于 GPU 处理内核的时间。对于小批量大小，这个问题更为常见。

当使用电感器时，如果启动开销是一个关注点，启用 CUDA 图通常可以帮助提高性能。