CUDA 语义 ¶

torch.cuda 用于设置和运行 CUDA 操作。它跟踪当前选定的 GPU，并且您分配的所有 CUDA 张量默认将在该设备上创建。可以使用 torch.cuda.device 上下文管理器更改选定的设备。

然而，一旦分配了张量，您就可以在所选设备上对其执行操作，并且结果将始终放置在张量相同的设备上。

默认情况下，不允许跨 GPU 操作，除非使用 copy_() 和其他具有类似复制功能的操作，如 to() 和 cuda() 。除非您启用对等内存访问，否则尝试在跨不同设备的张量上启动操作将引发错误。

下面是一个展示此功能的简单示例：

cuda = torch.device('cuda')     # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)

with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.tensor([1., 2.], device=cuda)

    # transfers a tensor from CPU to GPU 1
    b = torch.tensor([1., 2.]).cuda()
    # a.device and b.device are device(type='cuda', index=1)

    # You can also use ``Tensor.to`` to transfer a tensor:
    b2 = torch.tensor([1., 2.]).to(device=cuda)
    # b.device and b2.device are device(type='cuda', index=1)

    c = a + b
    # c.device is device(type='cuda', index=1)

    z = x + y
    # z.device is device(type='cuda', index=0)

    # even within a context, you can specify the device
    # (or give a GPU index to the .cuda call)
    d = torch.randn(2, device=cuda2)
    e = torch.randn(2).to(cuda2)
    f = torch.randn(2).cuda(cuda2)
    # d.device, e.device, and f.device are all device(type='cuda', index=2)

Ampere（及以后）设备上的 TensorFloat-32（TF32）

从 PyTorch 1.7 版本开始，有一个新的标志称为 allow_tf32。在 PyTorch 1.7 到 1.11 版本中，此标志默认为 True，在 PyTorch 1.12 及以后版本中为 False。此标志控制 PyTorch 是否允许使用 TensorFloat32（TF32）张量核心，该核心自 Ampere 以来在 NVIDIA GPU 上可用，以内部计算矩阵乘法（矩阵乘法和批量矩阵乘法）和卷积。

TF32 张量核心旨在通过将输入数据四舍五入到 10 位尾数，以 FP32 精度累积结果，同时保持 FP32 动态范围，从而在 torch.float32 张量上实现 matmul 和卷积操作的性能提升。

matmul 和卷积操作分别受控，相应的标志可以在以下位置访问：

# The flag below controls whether to allow TF32 on matmul. This flag defaults to False
# in PyTorch 1.12 and later.
torch.backends.cuda.matmul.allow_tf32 = True

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

matmul 的精度也可以更广泛地设置（不仅限于 CUDA）通过 set_float_32_matmul_precision() 。请注意，除了 matmul 和卷积操作本身外，内部使用 matmul 或卷积的函数和 nn 模块也会受到影响。这些包括 nn.Linear、nn.Conv*、cdist、tensordot、affine grid 和 grid sample、adaptive log softmax、GRU 和 LSTM。

要了解精度和速度，请参阅下面的示例代码和基准数据（在 A100 上）：

a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean()  # 80.7277

a = a_full.float()
b = b_full.float()

# Do matmul at TF32 mode.
torch.backends.cuda.matmul.allow_tf32 = True
ab_tf32 = a @ b  # takes 0.016s on GA100
error = (ab_tf32 - ab_full).abs().max()  # 0.1747
relative_error = error / mean  # 0.0022

# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b  # takes 0.11s on GA100
error = (ab_fp32 - ab_full).abs().max()  # 0.0031
relative_error = error / mean  # 0.000039

从上面的例子中，我们可以看到启用 TF32 后，在 A100 上的速度大约快 7 倍，与双精度相比，相对误差大约大两个数量级。请注意，TF32 与单精度速度的精确比率取决于硬件代系，因为内存带宽与计算比以及 TF32 与 FP32 矩阵乘吞吐量比等属性可能从一代到一代或从一种型号到另一种型号而变化。如果需要完整的 FP32 精度，用户可以通过以下方式禁用 TF32：

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

在 C++中，您可以通过以下方式切换 TF32 标志：

at::globalContext().setAllowTF32CuBLAS(false);
at::globalContext().setAllowTF32CuDNN(false);

有关 TF32 的更多信息，请参阅：

FP16 GEMMs 的精度降低（与专为具有比 FP32 积累更高吞吐量的 FP16 积累硬件设计的完整 FP16 积累不同，请参阅完整 FP16 积累）

（与专为具有比 FP32 积累更高吞吐量的 FP16 积累硬件设计的完整 FP16 积累不同，请参阅完整 FP16 积累）

fp16 GEMMs 可以通过一些中间降低精度来完成任务（例如，使用 fp16 而不是 fp32）。这种有选择的精度降低可以在某些工作负载（尤其是具有大 k 维度的那些）和 GPU 架构上实现更高的性能，但会牺牲数值精度和溢出的可能性。

在 V100 上的某些基准数据示例：

[--------------------------- bench_gemm_transformer --------------------------]
      [  m ,  k  ,  n  ]    |  allow_fp16_reduc=True  |  allow_fp16_reduc=False
1 threads: --------------------------------------------------------------------
      [4096, 4048, 4096]    |           1634.6        |           1639.8
      [4096, 4056, 4096]    |           1670.8        |           1661.9
      [4096, 4080, 4096]    |           1664.2        |           1658.3
      [4096, 4096, 4096]    |           1639.4        |           1651.0
      [4096, 4104, 4096]    |           1677.4        |           1674.9
      [4096, 4128, 4096]    |           1655.7        |           1646.0
      [4096, 4144, 4096]    |           1796.8        |           2519.6
      [4096, 5096, 4096]    |           2094.6        |           3190.0
      [4096, 5104, 4096]    |           2144.0        |           2663.5
      [4096, 5112, 4096]    |           2149.1        |           2766.9
      [4096, 5120, 4096]    |           2142.8        |           2631.0
      [4096, 9728, 4096]    |           3875.1        |           5779.8
      [4096, 16384, 4096]   |           6182.9        |           9656.5
(times in microseconds).

如果需要全精度降低，用户可以通过以下方式禁用 fp16 GEMMs 中的降低精度降低：

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False

要在 C++中切换降低精度降低标志，可以这样做：

at::globalContext().setAllowFP16ReductionCuBLAS(false);

BF16 GEMMs 中的精度降低减少

与上述类似，BFloat16 GEMMs 也存在这样一个标志。请注意，此开关默认设置为 True，如果您的负载观察到数值不稳定性，您可能希望将其设置为 False。

如果不希望使用精度降低减少，用户可以通过以下方式在 bf16 GEMMs 中禁用精度降低减少：

torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

在 C++中，可以通过以下方式切换精度降低减少标志：

at::globalContext().setAllowBF16ReductionCuBLAS(true);

全精度 FP16 累加在 FP16 GEMM 中的全 FP16 累加

某些 GPU 在进行所有 FP16 GEMM 累加时性能有所提升，但会牺牲数值精度并增加溢出的可能性。请注意，此设置仅对计算能力为 7.0（Volta）或更高版本的 GPU 有效。

此行为可以通过以下方式启用：

torch.backends.cuda.matmul.allow_fp16_accumulation = True

要在 C++中切换降低精度的减少标志，可以这样做：

at::globalContext().setAllowFP16AccumulationCuBLAS(true);

异步执行

默认情况下，GPU 操作是异步的。当你调用一个使用 GPU 的函数时，操作会被排队到特定的设备上，但不一定立即执行。这允许我们并行执行更多的计算，包括 CPU 或其他 GPU 上的操作。

通常，异步计算对调用者来说是不可见的，因为（1）每个设备按照它们排队的顺序执行操作，并且（2）PyTorch 在 CPU 和 GPU 之间或两个 GPU 之间复制数据时自动执行必要的同步。因此，计算将像每个操作都是同步执行一样进行。

你可以通过设置环境变量来强制同步计算。这当 GPU 上发生错误时可能很有用。（在异步执行中，这样的错误不会在操作实际执行后报告，因此堆栈跟踪不会显示请求的位置。）

异步计算的一个后果是，没有同步的时间测量不准确。要获得精确的测量结果，应在测量之前调用 torch.cuda.synchronize() ，或者使用 torch.cuda.Event 记录时间，如下所示：

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

# Run some things here

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)

作为例外，一些函数如 to() 和 copy_() 接受一个显式的 non_blocking 参数，允许调用者在不必要同步时绕过同步。另一个例外是 CUDA 流，下面将进行解释。

CUDA 流

CUDA 流是特定设备上的线性执行序列。通常您不需要显式创建它：默认情况下，每个设备都使用自己的“默认”流。

每个流内的操作按照创建顺序进行序列化，但来自不同流的操作可以并发执行，除非使用了显式的同步函数（如 synchronize() 或 wait_stream() ）。例如，以下代码是错误的：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # sum() may start execution before normal_() finishes!
    B = torch.sum(A)

当“当前流”是默认流时，PyTorch 会自动执行必要的同步，如上所述。然而，当使用非默认流时，用户有责任确保适当的同步。此示例的修正版本是：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
s.wait_stream(torch.cuda.default_stream(cuda))  # NEW!
with torch.cuda.stream(s):
    B = torch.sum(A)
A.record_stream(s)  # NEW!

增加了两个新功能。 torch.cuda.Stream.wait_stream() 调用确保在开始运行 sum(A) 之前， normal_() 的执行已经完成。 torch.Tensor.record_stream() （详见更多详情）确保在 sum(A) 完成之前不释放 A。您还可以在稍后的某个时间点手动等待流上的操作 torch.cuda.default_stream(cuda).wait_stream(s) （请注意，立即等待是没有意义的，因为这会阻止流执行与默认流上的其他工作并行运行。）有关 torch.Tensor.record_stream() 的详细信息，请参阅文档。

注意，即使没有读依赖，这种同步也是必要的，例如，如下例所示：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda)
s.wait_stream(torch.cuda.default_stream(cuda))  # STILL REQUIRED!
with torch.cuda.stream(s):
    A.normal_(0.0, 1.0)
    A.record_stream(s)

尽管对 s 的计算没有读取 A 的内容，也没有其他对 A 的使用，但仍然需要同步，因为 A 可能对应于 CUDA 缓存分配器重新分配的内存，其中包含来自旧（已释放）内存的挂起操作。

反向传播的流语义

每个反向 CUDA 操作都在其对应正向操作的同一流上运行。如果你的正向传播在并行流上独立运行操作，这将有助于反向传播利用相同的并行性。

关于向后调用相对于周围操作的流语义与任何其他调用相同。反向传播过程插入内部同步以确保即使在多个流上运行反向操作时也是如此，如前一段所述。更具体地说，当调用 autograd.backward ， autograd.grad 或 tensor.backward ，并可选地提供 CUDA 张量作为初始梯度（例如 autograd.backward(..., grad_tensors=initial_grads) ， autograd.grad(..., grad_outputs=initial_grads) 或 tensor.backward(..., gradient=initial_grad) ）时，以下行为：

可选地填充初始梯度，
调用反向传播，以及
使用梯度

与任何操作组的流语义关系相同：

s = torch.cuda.Stream()

# Safe, grads are used in the same stream context as backward()
with torch.cuda.stream(s):
    loss.backward()
    use grads

# Unsafe
with torch.cuda.stream(s):
    loss.backward()
use grads

# Safe, with synchronization
with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

# Safe, populating initial grad and invoking backward are in the same stream context
with torch.cuda.stream(s):
    loss.backward(gradient=torch.ones_like(loss))

# Unsafe, populating initial_grad and invoking backward are in different stream contexts,
# without synchronization
initial_grad = torch.ones_like(loss)
with torch.cuda.stream(s):
    loss.backward(gradient=initial_grad)

# Safe, with synchronization
initial_grad = torch.ones_like(loss)
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    initial_grad.record_stream(s)
    loss.backward(gradient=initial_grad)

BC 注释：在默认流上使用梯度 ¶

在 PyTorch 的早期版本（1.9 及之前），自动微分引擎始终将默认流与所有反向操作同步，因此以下模式：

with torch.cuda.stream(s):
    loss.backward()
use grads

只要 use grads 在默认流上发生，就是安全的。在当前的 PyTorch 中，这种模式不再安全。如果 backward() 和 use grads 位于不同的流上下文中，您必须同步流：

with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

即使 use grads 处于默认流中。

内存管理 §

PyTorch 使用缓存内存分配器来加速内存分配。这允许在不进行设备同步的情况下快速释放内存。然而，分配器管理的未使用内存仍然会显示为 nvidia-smi 中被使用。您可以使用 memory_allocated() 和 max_memory_allocated() 来监控张量占用的内存，并使用 memory_reserved() 和 max_memory_reserved() 来监控缓存分配器管理的总内存量。调用 empty_cache() 可以释放 PyTorch 中所有未使用的缓存内存，以便其他 GPU 应用程序可以使用。但是，张量占用的 GPU 内存不会被释放，因此它不能增加 PyTorch 可用的 GPU 内存量。

要更好地了解 CUDA 内存随时间的使用情况，请参阅《理解 CUDA 内存使用》以了解捕获和可视化内存使用痕迹的工具。

对于高级用户，我们提供通过 memory_stats() 进行的更全面的内存基准测试。我们还提供通过 memory_snapshot() 捕获内存分配器状态的完整快照的能力，这可以帮助您了解代码产生的底层分配模式。

使用 `PYTORCH_CUDA_ALLOC_CONF` 优化内存使用

使用缓存分配器可能会干扰内存检查工具，如 cuda-memcheck 。要使用 cuda-memcheck 调试内存错误，请将您的环境中的 PYTORCH_NO_CUDA_MEMORY_CACHING=1 设置为禁用缓存。

可以通过环境变量 PYTORCH_CUDA_ALLOC_CONF 控制缓存分配器的行为。格式为 PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>... 可用选项：

backend 允许选择底层分配器实现。目前，有效选项有 native ，它使用 PyTorch 的本地实现，和 cudaMallocAsync ，它使用 CUDA 的内置异步分配器。 cudaMallocAsync 需要 CUDA 11.4 或更高版本。默认为 native 。 backend 适用于进程使用的所有设备，不能按设备指定。
max_split_size_mb 防止本地分配器拆分大于此大小的块（以 MB 为单位）。这可以减少碎片并可能允许一些边缘工作负载在没有耗尽内存的情况下完成。性能成本可能从“零”到“显著”不等，具体取决于分配模式。默认值是无限制，即所有块都可以拆分。 memory_stats() 和 memory_summary() 方法很有用，用于调整。对于由于“内存不足”而终止且显示大量不活跃拆分块的工作负载，应将此选项作为最后的手段。 max_split_size_mb 仅与 backend:native 有关。使用 backend:cudaMallocAsync 时， max_split_size_mb 被忽略。
roundup_power2_divisions 帮助将请求的分配大小四舍五入到最接近的 2 的幂次方除数，并更好地利用块。在本地 CUDACachingAllocator 中，大小以 512 个块大小的倍数向上取整，因此对于较小的尺寸来说效果良好。然而，对于大型相邻分配，这可能会效率低下，因为每个分配都会使用不同大小的块，并且这些块的重用最小化。这可能会创建许多未使用的块，并浪费 GPU 内存容量。此选项启用将分配大小四舍五入到最接近的 2 的幂次方除数。例如，如果我们需要将 1200 的大小向上取整，如果除数是 4，则 1200 的大小位于 1024 和 2048 之间，如果我们在这两者之间进行 4 次除法，则值是 1024、1280、1536 和 1792。因此，1200 的分配大小将被四舍五入到 1280，这是最接近的 2 的幂次方除数的上限。指定一个值以应用于所有分配大小，或指定一个键值对数组，为每个 2 的幂次方间隔分别设置 2 的幂次方除数。例如，为了将所有小于 256MB 的分配设置为 1 个分区，256MB 到 512MB 之间的分配设置为 2 个分区，512MB 到 1GB 之间的分配设置为 4 个分区，任何更大的分配设置为 8 个分区，请将旋钮值设置为：[256:1,512:2,1024:4,>:8]。 roundup_power2_divisions 仅与 backend:native 结合使用。使用 backend:cudaMallocAsync 时， roundup_power2_divisions 被忽略。
max_non_split_rounding_mb 将允许非分割块以实现更好的重用，例如，
1024MB 的缓存块可以被用于 512MB 的分配请求。在默认情况下，我们只允许非分割块最多 20MB 的舍入，因此 512MB 的块只能使用 512-532MB 大小的块来提供服务。如果我们设置此选项的值为 1024，它将允许使用 512-1536MB 大小的块来为 512MB 的块提供服务，从而增加大块的重用。这也有助于减少避免昂贵的 cudaMalloc 调用时的停滞。

garbage_collection_threshold 帮助积极回收未使用的 GPU 内存，以避免触发昂贵的同步和回收所有操作（release_cached_blocks），这可能对延迟敏感的 GPU 应用程序（例如服务器）不利。在设置此阈值（例如，0.8）后，如果 GPU 内存容量使用率超过阈值（即分配给 GPU 应用程序的总内存的 80%），分配器将开始回收 GPU 内存块。算法优先释放旧的和未使用的块，以避免释放正在被积极重用的块。阈值值应在大于 0.0 和小于 1.0 之间。 garbage_collection_threshold 仅与 backend:native 有意义。使用 backend:cudaMallocAsync 时， garbage_collection_threshold 被忽略。

expandable_segments （实验性，默认：False）如果设置为 True，此设置指示分配器创建可以稍后扩展的 CUDA 分配，以更好地处理作业频繁更改分配大小的情况，例如具有变化的批量大小。通常对于大型（>2MB）分配，分配器调用 cudaMalloc 以获取与用户请求相同大小的分配。在未来，如果这些分配中有空闲部分，它们可以用于其他请求。当程序多次请求完全相同大小或该大小的倍数时，这种方法效果很好。许多深度学习模型遵循这种行为。然而，一个常见的例外是当批量大小从一个迭代到下一个迭代略有变化时，例如在批处理推理中。当程序以批量大小 N 运行时，它将进行适合该大小的分配。如果在将来以大小 N - 1 运行，现有的分配仍然足够大。但是，如果以大小 N + 1 运行，那么它将不得不创建稍微大一点的新的分配。并非所有张量都是相同的大小。某些可能是(N + 1)*A，而另一些是(N + 1)*A*B，其中 A 和 B 是模型中的某些非批量维度。因为当分配器足够大时，它会重用现有的分配，所以一些(N + 1)*A 的分配实际上可以适合已经存在的 N*B*A 段，尽管不是完美匹配。随着模型的运行，它将部分填满所有这些段，在段末留下不可用的空闲内存片。在某个时刻，分配器将需要 cudaMalloc 一个新的(N + 1)*A*B 段。如果没有足够的内存，现在没有办法恢复现有段末的空闲内存片。对于 50+层深的模型，这种模式可能会重复 50+次，产生许多碎片。

expandable_segments 允许分配器最初创建一个段，然后在需要更多内存时再扩展其大小。它不是为每个分配创建一个段，而是尝试为每个流创建一个（按需增长的）段。现在当运行 N + 1 情况时，分配将整齐地拼接到一个大型段中，直到填满。然后请求更多内存并将其附加到段的末尾。这个过程不会创建太多不可用的内存碎片，因此更有可能找到这种内存。

pinned_use_cuda_host_register 选项是一个布尔标志，用于确定是否使用 CUDA API 的 cudaHostRegister 函数来分配固定内存，而不是默认的 cudaHostAlloc。当设置为 True 时，内存使用常规 malloc 进行分配，然后在调用 cudaHostRegister 之前将页面映射到内存中。这种预先映射页面有助于减少 cudaHostRegister 执行期间的锁定时间。

当 pinned_use_cuda_host_register 设置为 True 时，pinned_num_register_threads 选项才有效。默认情况下，使用一个线程来映射页面。此选项允许使用更多线程来并行化页面映射操作，以减少固定内存的整体分配时间。根据基准测试结果，此选项的良好值是 8。

pinned_use_background_threads 选项是一个布尔标志，用于启用后台线程处理事件。这避免了与快速分配路径中事件查询/处理相关的任何慢路径。默认情况下，此功能是禁用的。

注意

CUDA 内存管理 API 报告的一些统计信息仅针对 backend:native ，与 backend:cudaMallocAsync 无关，不具有意义。请参阅每个函数的文档字符串以获取详细信息。

使用 CUDA 的自定义内存分配器

在 C/C++中，可以将分配器定义为简单的函数，并编译为共享库，下面的代码展示了仅跟踪所有内存操作的基本分配器。

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   cudaMalloc(&ptr, size);
   std::cout<<"alloc "<<ptr<<size<<std::endl;
   return ptr;
}

void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
   std::cout<<"free "<<ptr<< " "<<stream<<std::endl;
   cudaFree(ptr);
}
}

这可以通过 torch.cuda.memory.CUDAPluggableAllocator 在 Python 中使用。用户负责提供.so 文件的路径以及与上述签名匹配的 alloc/free 函数名称。

import torch

# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# Swap the current allocator
torch.cuda.memory.change_current_allocator(new_alloc)
# This will allocate memory in the device using the new allocator
b = torch.zeros(10, device='cuda')

import torch

# Do an initial memory allocator
b = torch.zeros(10, device='cuda')
# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(new_alloc)

在同一程序中混合不同的 CUDA 系统分配器

根据您的使用情况， change_current_allocator() 可能不是您想要使用的，因为它将整个程序的 CUDA 分配器进行交换（类似于 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync ）。例如，如果交换的分配器没有缓存机制，您将失去 PyTorch 的 CUDACachingAllocator 的所有好处。相反，您可以使用 torch.cuda.MemPool 选择性地标记 PyTorch 代码区域以使用自定义分配器。这将允许您在同一 PyTorch 程序中使用多个 CUDA 系统分配器，并享有 CUDACachingAllocator 的大部分好处（例如缓存）。使用 torch.cuda.MemPool ，您可以利用启用多个功能的自定义分配器，例如：

使用 ncclMemAlloc 分配器为 all-reduce 分配输出缓冲区可以启用 NVLink 交换减少（NVLS）。这可以减少 GPU 资源（SM 和复制引擎）上重叠的计算和通信内核之间的竞争，尤其是在张量并行工作负载上。
对于基于 Grace CPU 的系统，使用 cuMemCreate 和指定 CU_MEM_LOCATION_TYPE_HOST_NUMA 为 all-gather 分配主机输出缓冲区可以启用基于扩展 GPU 内存（EGM）的内存传输，从源 GPU 传输到目标 CPU。这可以加速 all-gather，因为传输是通过 NVLinks 进行的，否则将通过带宽受限的网络接口卡（NIC）链路进行。这种加速的 all-gather 反过来可以加快模型检查点。
如果你在构建模型时不想一开始就考虑内存密集型模块（例如嵌入表）的最佳内存放置，或者你可能有一个不是性能敏感且不适合 GPU 的模块，那么你可以使用 cudaMallocManaged 分配该模块并指定首选 CPU 位置，首先让模型运行起来。

注意

虽然 cudaMallocManaged 使用 CUDA 统一虚拟内存 (UVM) 提供方便的自动内存管理，但不建议用于深度学习 (DL) 工作负载。对于适合 GPU 内存的工作负载，显式放置始终优于 UVM，因为没有页面错误，访问模式保持可预测。当 GPU 内存饱和时，UVM 必须执行昂贵的双重传输，在引入新页面之前将页面驱逐到 CPU。

下面的代码展示了 ncclMemAlloc 被包装在 torch.cuda.memory.CUDAPluggableAllocator 中。

import os

import torch
import torch.distributed as dist
from torch.cuda.memory import CUDAPluggableAllocator
from torch.distributed.distributed_c10d import _get_default_group
from torch.utils import cpp_extension


# create allocator
nccl_allocator_source = """
#include <nccl.h>
#include <iostream>
extern "C" {

void* nccl_alloc_plug(size_t size, int device, void* stream) {
  std::cout << "Using ncclMemAlloc" << std::endl;
  void* ptr;
  ncclResult_t err = ncclMemAlloc(&ptr, size);
  return ptr;

}

void nccl_free_plug(void* ptr, size_t size, int device, void* stream) {
  std::cout << "Using ncclMemFree" << std::endl;
  ncclResult_t err = ncclMemFree(ptr);
}

}
"""
nccl_allocator_libname = "nccl_allocator"
nccl_allocator = torch.utils.cpp_extension.load_inline(
    name=nccl_allocator_libname,
    cpp_sources=nccl_allocator_source,
    with_cuda=True,
    extra_ldflags=["-lnccl"],
    verbose=True,
    is_python_module=False,
    build_directory="./",
)

allocator = CUDAPluggableAllocator(
    f"./{nccl_allocator_libname}.so", "nccl_alloc_plug", "nccl_free_plug"
).allocator()

# setup distributed
rank = int(os.getenv("RANK"))
local_rank = int(os.getenv("LOCAL_RANK"))
world_size = int(os.getenv("WORLD_SIZE"))
torch.cuda.set_device(local_rank)
dist.init_process_group(backend="nccl")
device = torch.device(f"cuda:{local_rank}")
default_pg = _get_default_group()
backend = default_pg._get_backend(device)

# Note: for convenience, ProcessGroupNCCL backend provides
# the ncclMemAlloc allocator as backend.mem_allocator
allocator = backend.mem_allocator

您现在可以通过传递此分配器到 torch.cuda.MemPool 来定义一个新的内存池：

pool = torch.cuda.MemPool(allocator)

然后，可以使用 torch.cuda.use_mem_pool 上下文管理器使用该池分配张量：

with torch.cuda.use_mem_pool(pool):
    # tensor gets allocated with ncclMemAlloc passed in the pool
    tensor = torch.arange(1024 * 1024 * 2, device=device)
    print(f"tensor ptr on rank {rank} is {hex(tensor.data_ptr())}")

# register user buffers using ncclCommRegister (called under the hood)
backend.register_mem_pool(pool)

# Collective uses Zero Copy NVLS
dist.all_reduce(tensor[0:4])
torch.cuda.synchronize()
print(tensor[0:4])

注意上述示例中 register_mem_pool 的使用。这是 NVLS 减少的额外步骤，用户缓冲区需要与 NCCL 注册。用户可以使用类似的 deregister_mem_pool 调用注销缓冲区。

要回收内存，用户首先需要确保没有任何东西在使用该池。当没有张量持有对池的引用时，在删除池时将内部调用 empty_cache() ，从而将所有内存返回给系统。

del tensor, del pool

以下 torch.cuda.MemPool.use_count() 和 torch.cuda.MemPool.snapshot() API 可用于调试目的：

pool = torch.cuda.MemPool(allocator)

# pool's use count should be 1 at this point as MemPool object
# holds a reference
assert pool.use_count() == 1

nelem_1mb = 1024 * 1024 // 4

with torch.cuda.use_mem_pool(pool):
    out_0 = torch.randn(nelem_1mb, device="cuda")

    # pool's use count should be 2 at this point as use_mem_pool
    # holds a reference
    assert pool.use_count() == 2

# pool's use count should be back to 1 at this point as use_mem_pool
# released its reference
assert pool.use_count() == 1

with torch.cuda.use_mem_pool(pool):
    # pool should have 1 segment since we made a small allocation (1 MB)
    # above and so the CUDACachingAllocator packed it into a 2 MB buffer
    assert len(pool.snapshot()) == 1

    out_1 = torch.randn(nelem_1mb, device="cuda")

    # pool should still have 1 segment since we made another small allocation
    # (1 MB) that got packed into the existing 2 MB buffer
    assert len(pool.snapshot()) == 1

    out_2 = torch.randn(nelem_1mb, device="cuda")

    # pool now should have 2 segments since the CUDACachingAllocator had
    # to make a new 2 MB buffer to accomodate out_2
    assert len(pool.snapshot()) == 2

注意

torch.cuda.MemPool 持有对池的引用。当您使用 torch.cuda.use_mem_pool 上下文管理器时，它也将获取对池的另一个引用。上下文管理器退出时，它将释放其引用。之后，理想情况下，只有持有对池的引用的张量。一旦张量释放其引用，池的使用计数将为 1，反映只有 torch.cuda.MemPool 对象持有引用。只有在这个时候，当使用 del 调用池的析构函数时，池持有的内存才能返回给系统。
目前不支持 CUDACachingAllocator 的 expandable_segments 模式。
NCCL 对缓冲区与 NVLS 累加操作兼容性有特定要求。这些要求在动态负载中可能会被破坏，例如，由 CUDACachingAllocator 发送到 NCCL 的缓冲区可能会被分割，因此无法正确对齐。在这种情况下，NCCL 可以使用回退算法而不是 NVLS。
由于对齐要求（ CU_MULTICAST_GRANULARITY_RECOMMENDED 、 CU_MULTICAST_GRANULARITY_MINIMUM ），像 ncclMemAlloc 这样的分配器可能会使用比请求更多的内存，这可能导致您的负载内存不足。

cuBLAS 工作空间

对于每个 cuBLAS 句柄和 CUDA 流组合，如果该句柄和流组合执行需要工作区的 cuBLAS 内核，则会分配 cuBLAS 工作区。为了避免重复分配工作区，除非调用 torch._C._cuda_clearCublasWorkspaces() ，否则这些工作区不会被释放。每次分配的工作区大小可以通过环境变量 CUBLAS_WORKSPACE_CONFIG 指定，格式为 :[SIZE]:[COUNT] 。例如，默认每次分配的工作区大小为 CUBLAS_WORKSPACE_CONFIG=:4096:2:16:8 ，指定总大小为 2 * 4096 + 8 * 16 KiB 。要强制 cuBLAS 避免使用工作区，请设置 CUBLAS_WORKSPACE_CONFIG=:0:0 。

cuFFT 计划缓存

对于每个 CUDA 设备，使用 LRU 缓存来加速对具有相同几何形状和相同配置的 CUDA 张量重复运行 FFT 方法（例如 torch.fft.fft() ）。由于一些 cuFFT 计划可能会分配 GPU 内存，因此这些缓存具有最大容量。

您可以使用以下 API 控制和查询当前设备的缓存属性：

torch.backends.cuda.cufft_plan_cache.max_size 表示缓存的容量（默认为 CUDA 10 及更高版本上的 4096，以及旧版本上的 1023）。直接设置此值将修改容量。
torch.backends.cuda.cufft_plan_cache.size 表示当前驻留在缓存中的计划数量。
torch.backends.cuda.cufft_plan_cache.clear() 清除缓存。

要控制和非默认设备的计划缓存，您可以使用 torch.backends.cuda.cufft_plan_cache 对象或设备索引来索引 torch.device 对象，并访问上述属性之一。例如，要设置设备 1 的缓存容量，可以编写 torch.backends.cuda.cufft_plan_cache[1].max_size = 10 。

实时编译

PyTorch 会对一些操作进行实时编译，例如在 CUDA 张量上执行 torch.special.zeta。这种编译可能耗时较长（取决于您的硬件和软件，可能长达几秒），并且对于单个操作可能会多次发生，因为许多 PyTorch 操作实际上会从多种内核中选择，每个内核都需要编译一次，具体取决于它们的输入。这种编译在每个进程中进行一次，或者如果使用内核缓存，则只进行一次。

默认情况下，如果定义了 XDG_CACHE_HOME，PyTorch 会在$XDG_CACHE_HOME/torch/kernels 中创建内核缓存；如果没有定义，则会在$HOME/.cache/torch/kernels 中创建（Windows 系统除外，内核缓存尚未支持）。可以通过两个环境变量直接控制缓存行为。如果将 USE_PYTORCH_KERNEL_CACHE 设置为 0，则不会使用缓存；如果设置了 PYTORCH_KERNEL_CACHE_PATH，则将使用该路径作为内核缓存，而不是默认位置。

最佳实践

设备无关的代码

由于 PyTorch 的结构，您可能需要显式编写设备无关（CPU 或 GPU）的代码；一个例子可能是创建一个新的张量作为循环神经网络初始隐藏状态。

第一步是确定是否使用 GPU。一个常见的模式是使用 Python 的 argparse 模块读取用户参数，并有一个可以用来禁用 CUDA 的标志，结合 is_available() 。在下面的例子中， args.device 生成一个 torch.device 对象，可以用来将张量移动到 CPU 或 CUDA。

import argparse
import torch

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
                    help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')

注意

当评估给定环境中 CUDA 的可用性（ is_available() ）时，PyTorch 的默认行为是调用 CUDA 运行时 API 方法 cudaGetDeviceCount。因为这个调用会初始化 CUDA 驱动 API（通过 cuInit），如果它尚未初始化，那么已经运行 is_available() 的进程的后续分叉将因 CUDA 初始化错误而失败。

在导入执行 is_available() （或直接执行它之前）PyTorch 模块之前，您可以在环境中设置 PYTORCH_NVML_BASED_CUDA_CHECK=1 ，以便让 is_available() 尝试进行基于 NVML 的评估（nvmlDeviceGetCount_v2）。如果基于 NVML 的评估成功（即 NVML 发现/初始化没有失败），则 is_available() 调用不会污染后续的分支。

如果 NVML 发现/初始化失败， is_available() 将回退到标准的 CUDA 运行时 API 评估，并应用上述分支约束。

注意，上述基于 NVML 的 CUDA 可用性评估提供的保证比默认的 CUDA 运行时 API 方法（需要 CUDA 初始化成功）要弱。在某些情况下，基于 NVML 的检查可能成功，而后续的 CUDA 初始化可能失败。

现在我们有了 args.device ，我们可以使用它来在所需设备上创建一个 Tensor。

x = torch.empty((8, 42), device=args.device)
net = Network().to(device=args.device)

这可以在多种情况下用于生成设备无关的代码。以下是在使用数据加载器时的一个示例：

cuda0 = torch.device('cuda:0')  # CUDA GPU 0
for i, x in enumerate(train_loader):
    x = x.to(cuda0)

当在系统上使用多个 GPU 时，您可以使用 CUDA_VISIBLE_DEVICES 环境标志来管理哪些 GPU 可供 PyTorch 使用。如上所述，要手动控制张量创建在哪个 GPU 上，最佳实践是使用 torch.cuda.device 上下文管理器。

print("Outside device is 0")  # On device 0 (default in most scenarios)
with torch.cuda.device(1):
    print("Inside device is 1")  # On device 1
print("Outside device is still 0")  # On device 0

如果您有一个张量，并希望在同一设备上创建相同类型的新的张量，则可以使用 torch.Tensor.new_* 方法（见 torch.Tensor ）。虽然之前提到的 torch.* 工厂函数（创建操作）依赖于当前的 GPU 上下文和您传递的参数，但 torch.Tensor.new_* 方法会保留张量的设备和其他属性。

这是在创建模块时推荐的做法，在这些模块中，在正向传递过程中需要内部创建新的张量。

cuda = torch.device('cuda')
x_cpu = torch.empty(2)
x_gpu = torch.empty(2, device=cuda)
x_cpu_long = torch.empty(2, dtype=torch.int64)

y_cpu = x_cpu.new_full([3, 2], fill_value=0.3)
print(y_cpu)

    tensor([[ 0.3000,  0.3000],
            [ 0.3000,  0.3000],
            [ 0.3000,  0.3000]])

y_gpu = x_gpu.new_full([3, 2], fill_value=-5)
print(y_gpu)

    tensor([[-5.0000, -5.0000],
            [-5.0000, -5.0000],
            [-5.0000, -5.0000]], device='cuda:0')

y_cpu_long = x_cpu_long.new_tensor([[1, 2, 3]])
print(y_cpu_long)

    tensor([[ 1,  2,  3]])

如果你想创建与另一个张量类型和大小相同的张量，并用全一或全零填充，则提供了 ones_like() 或 zeros_like() 作为方便的辅助函数（这些函数也保留了张量的 torch.device 和 torch.dtype ）。

x_cpu = torch.empty(2, 3)
x_gpu = torch.empty(2, 3)

y_cpu = torch.ones_like(x_cpu)
y_gpu = torch.zeros_like(x_gpu)

使用固定内存缓冲区

警告

这是一个高级技巧。如果你过度使用固定内存，当内存不足时可能会引起严重问题，你应该知道固定操作通常代价高昂。

当从固定（页面锁定）内存开始时，主机到 GPU 的复制速度要快得多。CPU 张量和存储暴露了一个 pin_memory() 方法，该方法返回对象的副本，并将数据放入固定区域。

此外，一旦您固定了一个张量或存储，您就可以使用异步 GPU 复制。只需在 to() 或 cuda() 调用中传递一个额外的 non_blocking=True 参数。这可以用来重叠数据传输和计算。

您可以通过向其构造函数传递 pin_memory=True 来使 DataLoader 返回放置在固定内存中的批次。

使用 nn.parallel.DistributedDataParallel 代替 multiprocessing 或 nn.DataParallel

大多数涉及批处理输入和多 GPU 的使用场景应默认使用 DistributedDataParallel 来利用多个 GPU。

使用 CUDA 模型时存在重大注意事项；除非精确满足数据处理要求，否则您的程序可能会出现错误或未定义的行为。

建议使用 DistributedDataParallel ，而不是 DataParallel 进行多 GPU 训练，即使只有一个节点。

DistributedDataParallel 和 DataParallel 之间的区别是： DistributedDataParallel 使用多进程，为每个 GPU 创建一个进程，而 DataParallel 使用多线程。通过使用多进程，每个 GPU 都有其专用的进程，这避免了 Python 解释器 GIL 带来的性能开销。

如果您使用 DistributedDataParallel ，可以使用 torch.distributed.launch 工具启动您的程序，请参阅第三方后端。

CUDA 图

CUDA 图记录了 CUDA 流及其依赖流执行的工作（主要是内核及其参数）。有关一般原则和底层 CUDA API 的详细信息，请参阅《CUDA 图入门》和 CUDA C 编程指南中的图部分。

PyTorch 支持使用流捕获构建 CUDA 图，这会将 CUDA 流置于捕获模式。发送到捕获流的 CUDA 工作实际上不会在 GPU 上运行。相反，该工作将记录在图中。

捕获后，可以启动图以运行 GPU 工作，所需次数。每次重放都会运行相同的内核，使用相同的参数。对于指针参数，这意味着使用相同的内存地址。通过在每次重放之前用新数据（例如，来自新批次的）填充输入内存，可以在新数据上重新运行相同的工作。

为什么使用 CUDA 图？

重放图以降低 CPU 开销为代价，牺牲了典型即时执行（eager execution）的动态灵活性。图的参数和内核是固定的，因此图重放跳过了所有参数设置和内核调度的层次，包括 Python、C++和 CUDA 驱动程序的开销。在底层，重放通过一次调用 cudaGraphLaunch 将整个图的工作提交给 GPU。重放中的内核在 GPU 上执行也略快，但省略 CPU 开销是主要好处。

如果您的网络全部或部分是图安全的（通常这意味着静态形状和静态控制流，但请参阅其他约束条件），并且您怀疑其运行时至少部分受 CPU 限制，那么您应该尝试使用 CUDA 图。

PyTorch API

警告

此 API 处于测试阶段，未来版本中可能会有所变化。

PyTorch 通过一个原始类 torch.cuda.CUDAGraph 和两个便利包装器 torch.cuda.graph 和 torch.cuda.make_graphed_callables 公开图。

torch.cuda.graph 是一个简单、通用的上下文管理器，用于捕获 CUDA 工作在其上下文中。在捕获之前，通过运行几个急切迭代来预热要捕获的工作负载。预热必须在侧流上执行。因为图在每次重放时都从相同的内存地址读取并写入，所以你必须维护对包含输入和输出数据的张量的长期引用。要在新的输入数据上运行图，将新数据复制到捕获的输入张量（s），重放图，然后从捕获的输出张量（s）读取新的输出。示例：

g = torch.cuda.CUDAGraph()

# Placeholder input used for capture
static_input = torch.empty((5,), device="cuda")

# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for _ in range(3):
        static_output = static_input * 2
torch.cuda.current_stream().wait_stream(s)

# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
with torch.cuda.graph(g):
    static_output = static_input * 2

# Fills the graph's input memory with new data to compute on
static_input.copy_(torch.full((5,), 3, device="cuda"))
g.replay()
# static_output holds the results
print(static_output)  # full of 3 * 2 = 6

# Fills the graph's input memory with more data to compute on
static_input.copy_(torch.full((5,), 4, device="cuda"))
g.replay()
print(static_output)  # full of 4 * 2 = 8

查看全网络捕获、与 torch.cuda.amp 的用法以及与多个流的用法，以实现现实和高级模式。

make_graphed_callables 更加复杂。 make_graphed_callables 接受 Python 函数和 torch.nn.Module 。对于每个传入的函数或模块，它都会创建独立的正向传播和反向传播工作图。参见部分网络捕获。

约束

一组操作可捕获，如果它不违反以下任何约束。

这些约束适用于 torch.cuda.graph 上下文中的所有工作以及您传递给 torch.cuda.make_graphed_callables() 的任何可调用对象的前向和反向传播中的所有工作。

违反其中任何一项很可能会导致运行时错误：

捕获必须在非默认流上执行。（这仅在使用原始的 CUDAGraph.capture_begin 和 CUDAGraph.capture_end 调用时是问题。 graph 和 make_graphed_callables() 会为您设置一个侧流。）
禁止同步 CPU 与 GPU 的操作（例如， .item() 调用）。
允许使用 CUDA RNG 操作，当在图中使用多个 torch.Generator 实例时，必须在捕获图之前使用 CUDAGraph.register_generator_state 进行注册。在捕获期间避免使用 Generator.get_state 和 Generator.set_state ；相反，利用 Generator.graphsafe_set_state 和 Generator.graphsafe_get_state 在图上下文中安全地管理生成器状态。这确保了 CUDA 图中的正确 RNG 操作和生成器管理。

违反任何一条都可能引起静默数值错误或未定义行为：

在一个过程中，一次只能进行一个捕获。
在捕获进行时，此过程中的任何非捕获的 CUDA 工作（在任何线程上）都不能运行。
CPU 工作不会被捕获。如果捕获的操作包括 CPU 工作，该工作将在回放期间被省略。
每次回放都从相同的（虚拟）内存地址读取和写入。
动态控制流（基于 CPU 或 GPU 数据）是被禁止的。
动态形状是被禁止的。图假设在每次重放中捕获的操作序列中的每个张量都具有相同的大小和布局。
在捕获中使用多个流是被允许的，但有一些限制。

非约束项

一旦捕获，该图可以在任何流上回放。

整网捕获 ¶

如果您的整个网络可以被捕获，您可以捕获并回放整个迭代：

N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.Dropout(p=0.2),
                            torch.nn.Linear(H, D_out),
                            torch.nn.Dropout(p=0.1)).cuda()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        y_pred = model(static_input)
        loss = loss_fn(y_pred, static_target)
        loss.backward()
        optimizer.step()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)
    static_loss.backward()
    optimizer.step()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    # Fills the graph's input memory with new data to compute on
    static_input.copy_(data)
    static_target.copy_(target)
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    g.replay()
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.

部分网络捕获 ¶

如果您的部分网络无法安全捕获（例如，由于动态控制流、动态形状、CPU 同步或必要的 CPU 端逻辑），您可以提前运行不安全的部分，并使用 torch.cuda.make_graphed_callables() 来仅图形化捕获安全的部分。

默认情况下， make_graphed_callables() 返回的可调用对象是 autograd 感知的，可以用作训练循环中的直接替换，用于您传递的函数或 nn.Module 。

make_graphed_callables() 内部创建 CUDAGraph 对象，运行预热迭代，并根据需要维护静态输入和输出。因此（与 torch.cuda.graph 不同），您不需要手动处理这些。

在以下示例中，数据相关的动态控制流意味着网络无法端到端捕获，但 make_graphed_callables() 允许我们捕获并作为图运行图安全部分。

N, D_in, H, D_out = 640, 4096, 2048, 1024

module1 = torch.nn.Linear(D_in, H).cuda()
module2 = torch.nn.Linear(H, D_out).cuda()
module3 = torch.nn.Linear(H, D_out).cuda()

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(chain(module1.parameters(),
                                  module2.parameters(),
                                  module3.parameters()),
                            lr=0.1)

# Sample inputs used for capture
# requires_grad state of sample inputs must match
# requires_grad state of real inputs each callable will see.
x = torch.randn(N, D_in, device='cuda')
h = torch.randn(N, H, device='cuda', requires_grad=True)

module1 = torch.cuda.make_graphed_callables(module1, (x,))
module2 = torch.cuda.make_graphed_callables(module2, (h,))
module3 = torch.cuda.make_graphed_callables(module3, (h,))

real_inputs = [torch.rand_like(x) for _ in range(10)]
real_targets = [torch.randn(N, D_out, device="cuda") for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    optimizer.zero_grad(set_to_none=True)

    tmp = module1(data)  # forward ops run as a graph

    if tmp.sum().item() > 0:
        tmp = module2(tmp)  # forward ops run as a graph
    else:
        tmp = module3(tmp)  # forward ops run as a graph

    loss = loss_fn(tmp, target)
    # module2's or module3's (whichever was chosen) backward ops,
    # as well as module1's backward ops, run as graphs
    loss.backward()
    optimizer.step()

使用 torch.cuda.amp 的示例

对于典型的优化器， GradScaler.step 在捕获期间会同步 CPU 和 GPU，这是被禁止的。为了避免错误，可以使用部分网络捕获，或者（如果 forward、loss 和 backward 是捕获安全的）捕获 forward、loss 和 backward，但不捕获优化器步骤：

# warmup
# In a real setting, use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast():
            y_pred = model(static_input)
            loss = loss_fn(y_pred, static_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    with torch.cuda.amp.autocast():
        static_y_pred = model(static_input)
        static_loss = loss_fn(static_y_pred, static_target)
    scaler.scale(static_loss).backward()
    # don't capture scaler.step(optimizer) or scaler.update()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    static_input.copy_(data)
    static_target.copy_(target)
    g.replay()
    # Runs scaler.step and scaler.update eagerly
    scaler.step(optimizer)
    scaler.update()

使用多个流的示例

捕获模式会自动传播到与捕获流同步的任何流。在捕获过程中，您可以通过向不同的流发出调用来暴露并行性，但整体流依赖的有向无环图必须在捕获开始后从初始捕获流分支出来，并在捕获结束时重新连接到初始流：

with torch.cuda.graph(g):
    # at context manager entrance, torch.cuda.current_stream()
    # is the initial capturing stream

    # INCORRECT (does not branch out from or rejoin initial stream)
    with torch.cuda.stream(s):
        cuda_work()

    # CORRECT:
    # branches out from initial stream
    s.wait_stream(torch.cuda.current_stream())
    with torch.cuda.stream(s):
        cuda_work()
    # rejoins initial stream before capture ends
    torch.cuda.current_stream().wait_stream(s)

注意

为了避免在 nsight 系统或 nvprof 中的回放时对高级用户造成混淆：与急切执行不同，图将捕获中的非平凡流 DAG 视为提示，而非命令。在回放过程中，图可能会将独立的操作重新组织到不同的流中，或者以不同的顺序排队（同时尊重您原始 DAG 的整体依赖关系）。

与 DistributedDataParallel 的用法

NCCL < 2.9.6

NCCL 版本早于 2.9.6 的版本不允许捕获集体操作。您必须使用部分网络捕获，这会将所有 reduce 操作推迟到反向传播的图外部分执行。

在将网络用 DDP 包裹之前，请在可绘制网络部分调用 make_graphed_callables() 。

NCCL >= 2.9.6¶

NCCL 2.9.6 或更高版本允许在图中使用集体操作。捕获整个反向传播过程的方案是可行的，但需要三个设置步骤。

禁用 DDP 的内部异步错误处理：

os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0"
torch.distributed.init_process_group(...)

在进行完整回溯捕获之前，DDP 必须在侧流上下文中构建：

with torch.cuda.stream(s):
    model = DistributedDataParallel(model)

您的预热必须在捕获之前至少运行 11 次 DDP 启用的大致迭代。

图内存管理

捕获的图每次重放时都作用于相同的虚拟地址。如果 PyTorch 释放了内存，后续的重放可能会遇到非法内存访问。如果 PyTorch 将内存重新分配给新的张量，重放可能会损坏这些张量看到的值。因此，图使用的虚拟地址必须在重放之间为图保留。PyTorch 缓存分配器通过检测捕获正在进行时，并从图私有内存池中满足捕获的分配来实现这一点。私有池将持续存在，直到其 CUDAGraph 对象和捕获期间创建的所有张量超出作用域。

私有池会自动维护。默认情况下，分配器为每个捕获创建一个单独的私有池。如果您捕获多个图，这种保守的方法可以确保图的重放永远不会相互破坏值，但有时会无谓地浪费内存。