分布式管道并行性简介 ¶

创建于：2025 年 4 月 1 日 | 最后更新：2025 年 4 月 1 日 | 最后验证：2024 年 11 月 5 日

作者：黄浩然

备注

在 github 上查看和编辑此教程。

本教程使用 gpt 风格的 transformer 模型，通过 torch.distributed.pipelining API 演示实现分布式管道并行性。

你将学到什么

如何使用 torch.distributed.pipelining API
如何将流水线并行应用于 Transformer 模型
如何在一系列微批次上利用不同的调度

前提条件

熟悉 PyTorch 中的基本分布式训练

设置

使用 torch.distributed.pipelining 我们将对模型的执行进行分区，并在微批次上调度计算。我们将使用一个简化的 Transformer 解码器模型。该模型架构用于教育目的，具有多个 Transformer 解码器层，因为我们想展示如何将模型分割成不同的部分。首先，让我们定义模型：

import torch
import torch.nn as nn
from dataclasses import dataclass

@dataclass
class ModelArgs:
   dim: int = 512
   n_layers: int = 8
   n_heads: int = 8
   vocab_size: int = 10000

class Transformer(nn.Module):
   def __init__(self, model_args: ModelArgs):
      super().__init__()

      self.tok_embeddings = nn.Embedding(model_args.vocab_size, model_args.dim)

      # Using a ModuleDict lets us delete layers witout affecting names,
      # ensuring checkpoints will correctly save and load.
      self.layers = torch.nn.ModuleDict()
      for layer_id in range(model_args.n_layers):
            self.layers[str(layer_id)] = nn.TransformerDecoderLayer(model_args.dim, model_args.n_heads)

      self.norm = nn.LayerNorm(model_args.dim)
      self.output = nn.Linear(model_args.dim, model_args.vocab_size)

   def forward(self, tokens: torch.Tensor):
      # Handling layers being 'None' at runtime enables easy pipeline splitting
      h = self.tok_embeddings(tokens) if self.tok_embeddings else tokens

      for layer in self.layers.values():
            h = layer(h, h)

      h = self.norm(h) if self.norm else h
      output = self.output(h).clone() if self.output else h
      return output

然后，我们需要在我们的脚本中导入必要的库并初始化分布式训练过程。在这种情况下，我们定义了一些全局变量，稍后将在脚本中使用：

import os
import torch.distributed as dist
from torch.distributed.pipelining import pipeline, SplitPoint, PipelineStage, ScheduleGPipe

global rank, device, pp_group, stage_index, num_stages
def init_distributed():
   global rank, device, pp_group, stage_index, num_stages
   rank = int(os.environ["LOCAL_RANK"])
   world_size = int(os.environ["WORLD_SIZE"])
   device = torch.device(f"cuda:{rank}") if torch.cuda.is_available() else torch.device("cpu")
   dist.init_process_group()

   # This group can be a sub-group in the N-D parallel case
   pp_group = dist.new_group()
   stage_index = rank
   num_stages = world_size

rank 、 world_size 和 init_process_group() 代码应该对你很熟悉，因为这些在所有分布式程序中都是常用的。特定于管道并行性的全局变量包括 pp_group ，它将用于发送/接收通信的过程组， stage_index 在本例中是每个阶段的单个 rank，因此索引等同于 rank， num_stages 等同于 world_size。

num_stages 用于设置在管道并行调度中将要使用的阶段数量。例如，对于 num_stages=4 ，一个微批需要经过 4 次正向和 4 次反向才能完成。 stage_index 对于框架知道如何在不同阶段之间通信是必要的。例如，对于第一个阶段（ stage_index=0 ），它将使用数据加载器中的数据，并且不需要从任何先前的对等节点接收数据来执行其计算。

步骤 1：划分 Transformer 模型

划分模型有两种不同的方式：

第一种是手动模式，我们可以手动创建两个模型实例，通过删除模型的部分属性来实现。在这个例子中，对于两个阶段（2 个排名），模型被切成两半。

def manual_model_split(model) -> PipelineStage:
   if stage_index == 0:
      # prepare the first stage model
      for i in range(4, 8):
            del model.layers[str(i)]
      model.norm = None
      model.output = None

   elif stage_index == 1:
      # prepare the second stage model
      for i in range(4):
            del model.layers[str(i)]
      model.tok_embeddings = None

   stage = PipelineStage(
      model,
      stage_index,
      num_stages,
      device,
   )
   return stage

如我们所见，第一阶段没有层归一化或输出层，只包括前四个 Transformer 块。第二阶段没有输入嵌入层，但包括输出层和最后的四个 Transformer 块。然后函数返回当前排名的 PipelineStage 。

第二种方法是基于跟踪器的模式，它根据 split_spec 参数自动分割模型。使用管道规范，我们可以指示 torch.distributed.pipelining 在何处分割模型。在下面的代码块中，我们在第 4 个 Transformer 解码器层之前进行分割，与上面描述的手动分割相呼应。同样，我们可以在分割完成后调用 build_stage 来检索 PipelineStage 。

步骤 2：定义主执行

在主函数中，我们将创建一个特定的管道调度，以确定阶段应遵循的顺序。 torch.distributed.pipelining 支持多种调度，包括单阶段每排名调度 GPipe 和 1F1B ，以及多阶段每排名调度 Interleaved1F1B 和 LoopedBFS 。

if __name__ == "__main__":
   init_distributed()
   num_microbatches = 4
   model_args = ModelArgs()
   model = Transformer(model_args)

   # Dummy data
   x = torch.ones(32, 500, dtype=torch.long)
   y = torch.randint(0, model_args.vocab_size, (32, 500), dtype=torch.long)
   example_input_microbatch = x.chunk(num_microbatches)[0]

   # Option 1: Manual model splitting
   stage = manual_model_split(model)

   # Option 2: Tracer model splitting
   # stage = tracer_model_split(model, example_input_microbatch)

   model.to(device)
   x = x.to(device)
   y = y.to(device)

   def tokenwise_loss_fn(outputs, targets):
      loss_fn = nn.CrossEntropyLoss()
      outputs = outputs.reshape(-1, model_args.vocab_size)
      targets = targets.reshape(-1)
      return loss_fn(outputs, targets)

   schedule = ScheduleGPipe(stage, n_microbatches=num_microbatches, loss_fn=tokenwise_loss_fn)

   if rank == 0:
      schedule.step(x)
   elif rank == 1:
      losses = []
      output = schedule.step(target=y, losses=losses)
      print(f"losses: {losses}")
   dist.destroy_process_group()

在上面的例子中，我们使用手动方法来分割模型，但可以通过取消注释代码来尝试基于追踪器的模型分割函数。在我们的计划中，需要传入微批次的数量和用于评估目标的损失函数。

.step() 函数处理整个小批量，并根据之前传递的 n_microbatches 自动将其分割成微批量。然后根据调度类对微批量进行操作。在上面的例子中，我们使用 GPipe，它遵循简单的全前向然后全后向的调度。从排名 1 返回的输出将与模型在单个 GPU 上运行整个批次的输出相同。同样，我们可以传递一个 losses 容器来存储每个微批量的对应损失。

第 3 步：启动分布式进程

最后，我们准备运行脚本。我们将使用 torchrun 创建一个单主机、2 进程的工作。我们的脚本已经编写得很好，rank 0 执行管道阶段 0 所需的逻辑，而 rank 1 执行管道阶段 1 的逻辑。

torchrun --nnodes 1 --nproc_per_node 2 pipelining_tutorial.py

结论 ¶

在本教程中，我们学习了如何使用 PyTorch 的 torch.distributed.pipelining API 实现分布式管道并行。我们探讨了设置环境、定义 Transformer 模型以及将其分区以进行分布式训练。我们讨论了两种模型分区方法，手动和基于跟踪器，并演示了如何在不同的阶段对微批次的计算进行调度。最后，我们介绍了管道调度的执行和分布式进程的启动使用 torchrun 。

补充资源 ¶

我们已成功将 torch.distributed.pipelining 集成到 torchtitan 仓库中。TorchTitan 是一个用于大规模LLM训练的干净、最小化代码库，使用原生 PyTorch。有关管道并行以及与其他分布式技术组合的生产就绪使用，请参阅 TorchTitan 3D 并行的端到端示例。