（原型）PyTorch 2 导出后训练量化 ¶

创建于：2025 年 4 月 1 日 | 最后更新：2025 年 4 月 1 日 | 最后验证：2024 年 11 月 5 日

作者：张杰

本教程介绍了基于 torch._export.export 在图模式下进行后训练静态量化的步骤。与 FX 图模式量化相比，此流程预计具有显著更高的模型覆盖率（14K 模型上的 88%），更好的可编程性和简化的用户体验。

torch.export.export 可导出是使用此流程的先决条件，您可以在 Export DB 中找到支持的结构。

量化 2 的高层架构，使用量化器可能看起来是这样的：

float_model(Python)                          Example Input
    \                                              /
     \                                            /
—-------------------------------------------------------
|                        export                        |
—-------------------------------------------------------
                            |
                    FX Graph in ATen     Backend Specific Quantizer
                            |                       /
—--------------------------------------------------------
|                     prepare_pt2e                      |
—--------------------------------------------------------
                            |
                     Calibrate/Train
                            |
—--------------------------------------------------------
|                    convert_pt2e                       |
—--------------------------------------------------------
                            |
                    Quantized Model
                            |
—--------------------------------------------------------
|                       Lowering                        |
—--------------------------------------------------------
                            |
        Executorch, Inductor or <Other Backends>

PyTorch 2 导出量化 API 可能看起来是这样的：

import torch
class M(torch.nn.Module):
   def __init__(self):
      super().__init__()
      self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
      return self.linear(x)


example_inputs = (torch.randn(1, 5),)
m = M().eval()

# Step 1. program capture
# This is available for pytorch 2.5+, for more details on lower pytorch versions
# please check `Export the model with torch.export` section
m = torch.export.export_for_training(m, example_inputs).module()
# we get a model with aten ops


# Step 2. quantization
from torch.ao.quantization.quantize_pt2e import (
  prepare_pt2e,
  convert_pt2e,
)

from torch.ao.quantization.quantizer.xnnpack_quantizer import (
  XNNPACKQuantizer,
  get_symmetric_quantization_config,
)
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
m = prepare_pt2e(m, quantizer)

# calibration omitted

m = convert_pt2e(m)
# we have a model with aten ops doing integer computations when possible

PyTorch 2 导出量化动机 ¶

在 PyTorch 2 之前的版本中，我们有 FX Graph Mode 量化，它使用 QConfigMapping 和 BackendConfig 进行自定义。 QConfigMapping 允许建模用户指定他们希望如何量化他们的模型， BackendConfig 允许后端开发者指定他们后端支持的量化方式。虽然该 API 相对较好地覆盖了大多数用例，但它并不完全可扩展。当前 API 有两个主要限制：

使用现有对象（ QConfig 和 QConfigMapping ）表达复杂算子模式量化意图的限制（如何观察/量化算子模式）。
用户表达模型量化意图的支持有限。例如，如果用户希望量化模型中的每隔一个线性层，或者量化行为依赖于张量的实际形状（例如，只有当线性层有 3D 输入时才观察/量化输入和输出），后端开发者或建模用户需要更改核心量化 API/流程。

一些改进可以使现有的流程变得更好：

我们将 QConfigMapping 和 BackendConfig 作为独立对象使用， QConfigMapping 描述用户希望模型如何量化的意图， BackendConfig 描述后端支持的量化类型。 BackendConfig 是后端特定的，但 QConfigMapping 不是，用户可以提供一个与特定 QConfigMapping 不兼容的 BackendConfig ，这不是一个好的用户体验。理想情况下，我们可以通过使配置（ QConfigMapping ）和量化能力（ BackendConfig ）后端特定来更好地组织结构，这样就会减少不兼容性的困惑。
在 QConfig 中，我们将 observer/ fake_quant 观察者类暴露为对象，供用户配置量化，这增加了用户可能需要关注的事项。例如，不仅包括 dtype ，还包括观察应该如何发生，这些可能被隐藏起来，以便简化用户流程。

下面是新 API 的总结：

可编程性（针对 1.和 2.）：当用户的量化需求无法由现有量化器覆盖时，用户可以构建自己的量化器，并将其与其他量化器组合，如上所述。
简化的用户体验（针对 3.）：提供单个实例，供后端和用户交互。因此，您不再需要将用户意图映射到量化配置映射，以及后端交互的单独量化配置来配置后端支持。我们仍将提供一种方法供用户查询量化器支持的内容。使用单个实例，组合不同的量化能力也比以前更自然。

例如，XNNPACK 不支持 embedding_byte ，而我们原生支持 ExecuTorch 中的这一功能。因此，如果我们有 ExecuTorchQuantizer ，它只量化 embedding_byte ，那么它可以与 XNNPACKQuantizer 组合。（之前，这通常是将两个 BackendConfig 连接起来，并且由于 QConfigMapping 中的选项不是后端特定的，用户还需要自己找出如何指定配置以匹配组合后端的支持量化能力。使用单个量化器实例，我们可以组合两个量化器，并查询组合量化器的功能，这使得它更不容易出错，更简洁，例如， composed_quantizer.quantization_capabilities()) 。）
关注点分离（解决 4.）：在设计量化器 API 的同时，我们还解耦了量化规格的指定，如用 dtype 表示，最小/最大（位数），对称等，以及观察者概念。目前，观察者同时捕捉量化规格和如何观察（直方图 vs MinMax 观察者）。通过这一变化，建模用户从与观察者和伪量化对象交互中解放出来。

定义辅助函数和准备数据集 ¶

我们将首先进行必要的导入，定义一些辅助函数并准备数据。这些步骤与 PyTorch 中的静态量化（Eager Mode）相同。

要运行本教程中使用整个 ImageNet 数据集的代码，首先按照这里 ImageNet 数据集的说明下载 ImageNet。将下载的文件解压到 data_path 文件夹中。

下载 torchvision 的 resnet18 模型并将其重命名为 data/resnet18_pretrained_float.pth 。

import os
import sys
import time
import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

import torchvision
from torchvision import datasets
from torchvision.models.resnet import resnet18
import torchvision.transforms as transforms

# Set up warnings
import warnings
warnings.filterwarnings(
    action='ignore',
    category=DeprecationWarning,
    module=r'.*'
)
warnings.filterwarnings(
    action='default',
    module=r'torch.ao.quantization'
)

# Specify random seed for repeatable results
_ = torch.manual_seed(191009)


class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


def accuracy(output, target, topk=(1,)):
    """
    Computes the accuracy over the k top predictions for the specified
    values of k.
    """
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


def evaluate(model, criterion, data_loader):
    model.eval()
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    cnt = 0
    with torch.no_grad():
        for image, target in data_loader:
            output = model(image)
            loss = criterion(output, target)
            cnt += 1
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            top1.update(acc1[0], image.size(0))
            top5.update(acc5[0], image.size(0))
    print('')

    return top1, top5

def load_model(model_file):
    model = resnet18(pretrained=False)
    state_dict = torch.load(model_file, weights_only=True)
    model.load_state_dict(state_dict)
    model.to("cpu")
    return model

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print("Size (MB):", os.path.getsize("temp.p")/1e6)
    os.remove("temp.p")

def prepare_data_loaders(data_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    dataset = torchvision.datasets.ImageNet(
        data_path, split="train", transform=transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))
    dataset_test = torchvision.datasets.ImageNet(
        data_path, split="val", transform=transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]))

    train_sampler = torch.utils.data.RandomSampler(dataset)
    test_sampler = torch.utils.data.SequentialSampler(dataset_test)

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=train_batch_size,
        sampler=train_sampler)

    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=eval_batch_size,
        sampler=test_sampler)

    return data_loader, data_loader_test

data_path = '~/.data/imagenet'
saved_model_dir = 'data/'
float_model_file = 'resnet18_pretrained_float.pth'

train_batch_size = 30
eval_batch_size = 50

data_loader, data_loader_test = prepare_data_loaders(data_path)
example_inputs = (next(iter(data_loader))[0])
criterion = nn.CrossEntropyLoss()
float_model = load_model(saved_model_dir + float_model_file).to("cpu")
float_model.eval()

# create another instance of the model since
# we need to keep the original model around
model_to_quantize = load_model(saved_model_dir + float_model_file).to("cpu")

将模型设置为评估模式 ¶

对于后训练量化，我们需要将模型设置为评估模式。

model_to_quantize.eval()

使用 torch.export 导出模型 ¶

使用 torch.export 导出模型的方法如下：

example_inputs = (torch.rand(2, 3, 224, 224),)
# for pytorch 2.5+
exported_model = torch.export.export_for_training(model_to_quantize, example_inputs).module()

# for pytorch 2.4 and before
# from torch._export import capture_pre_autograd_graph
# exported_model = capture_pre_autograd_graph(model_to_quantize, example_inputs)

# or capture with dynamic dimensions
# for pytorch 2.5+
dynamic_shapes = tuple(
  {0: torch.export.Dim("dim")} if i == 0 else None
  for i in range(len(example_inputs))
)
exported_model = torch.export.export_for_training(model_to_quantize, example_inputs, dynamic_shapes=dynamic_shapes).module()

# for pytorch 2.4 and before
# dynamic_shape API may vary as well
# from torch._export import dynamic_dim
# exported_model = capture_pre_autograd_graph(model_to_quantize, example_inputs, constraints=[dynamic_dim(example_inputs[0], 0)])

导入后端特定量化器并配置模型量化方式 ¶

以下代码片段描述了如何量化模型：

from torch.ao.quantization.quantizer.xnnpack_quantizer import (
  XNNPACKQuantizer,
  get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer()
quantizer.set_global(get_symmetric_quantization_config())

Quantizer 是后端特定的，每个 Quantizer 将提供自己的方式来允许用户配置他们的模型。以下是一个例子，展示了 XNNPackQuantizer 支持的不同的配置 API：

quantizer.set_global(qconfig_opt)  # qconfig_opt is an optional quantization config
    .set_object_type(torch.nn.Conv2d, qconfig_opt) # can be a module type
    .set_object_type(torch.nn.functional.linear, qconfig_opt) # or torch functional op
    .set_module_name("foo.bar", qconfig_opt)

备注

查看我们的教程，它描述了如何编写一个新的 Quantizer 。

准备模型进行后训练量化 ¶

prepare_pt2e 将 BatchNorm 算子折叠到前面的 Conv2d 算子中，并在模型中适当的位置插入观察者。

prepared_model = prepare_pt2e(exported_model, quantizer)
print(prepared_model.graph)

校准

校准函数在观察者在模型中插入后运行。校准的目的是运行一些具有代表性的工作负载样本示例（例如训练数据集的样本），以便模型中的观察者能够观察张量的统计信息，我们随后可以使用这些信息来计算量化参数。

def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)
calibrate(prepared_model, data_loader_test)  # run calibration on sample data

将校准模型转换为量化模型

convert_pt2e 接受一个校准后的模型并生成一个量化模型。

quantized_model = convert_pt2e(prepared_model)
print(quantized_model)

在这一步，我们目前提供两种您可以选择的表示形式，但长期提供的具体表示可能会根据 PyTorch 用户的反馈而改变。

Q/DQ 表示（默认）

之前的文档表示所有量化操作均表示为 dequantize -> fp32_op -> qauntize 。

def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_fp32, output_scale, output_zero_point):
    x_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor(
             x_i8, x_scale, x_zero_point, x_quant_min, x_quant_max, torch.int8)
    weight_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor(
             weight_i8, weight_scale, weight_zero_point, weight_quant_min, weight_quant_max, torch.int8)
    weight_permuted = torch.ops.aten.permute_copy.default(weight_fp32, [1, 0]);
    out_fp32 = torch.ops.aten.addmm.default(bias_fp32, x_fp32, weight_permuted)
    out_i8 = torch.ops.quantized_decomposed.quantize_per_tensor(
    out_fp32, out_scale, out_zero_point, out_quant_min, out_quant_max, torch.int8)
    return out_i8

参考量化模型表示

我们将为选定的操作符提供特殊表示，例如量化线性。其他操作符表示为 dq -> float32_op -> q 和 q/dq 被分解为更原始的操作符。您可以通过使用 convert_pt2e(..., use_reference_representation=True) 来获取这种表示。

# Reference Quantized Pattern for quantized linear
def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_fp32, output_scale, output_zero_point):
    x_int16 = x_int8.to(torch.int16)
    weight_int16 = weight_int8.to(torch.int16)
    acc_int32 = torch.ops.out_dtype(torch.mm, torch.int32, (x_int16 - x_zero_point), (weight_int16 - weight_zero_point))
    bias_scale = x_scale * weight_scale
    bias_int32 = out_dtype(torch.ops.aten.div.Tensor, torch.int32, bias_fp32, bias_scale)
    acc_int32 = acc_int32 + bias_int32
    acc_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, torch.int32, acc_int32, x_scale * weight_scale / output_scale) + output_zero_point
    out_int8 = torch.ops.aten.clamp(acc_int32, qmin, qmax).to(torch.int8)
    return out_int8

请在此处查看最新的参考表示。

检查模型大小和精度评估

现在我们可以将模型大小和准确度与基线模型进行比较。

# Baseline model size and accuracy
print("Size of baseline model")
print_size_of_model(float_model)

top1, top5 = evaluate(float_model, criterion, data_loader_test)
print("Baseline Float Model Evaluation accuracy: %2.2f, %2.2f"%(top1.avg, top5.avg))

# Quantized model size and accuracy
print("Size of model after quantization")
# export again to remove unused weights
quantized_model = torch.export.export_for_training(quantized_model, example_inputs).module()
print_size_of_model(quantized_model)

top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("[before serilaization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

备注

由于模型尚未降低到目标设备，我们目前无法进行性能评估，它只是 ATen 运算符中量化计算的表示。

备注

当前权重仍然是 fp32，我们未来可能会对量化操作进行常量传播，以获得整数权重。

如果您想提高准确度或性能，请尝试以不同的方式配置 quantizer ，每个 quantizer 都有自己的配置方式，因此请查阅您所使用的量化的文档，以了解更多关于如何更好地控制模型量化的信息。

保存和加载量化模型

我们将展示如何保存和加载量化模型。

# 0. Store reference output, for example, inputs, and check evaluation accuracy:
example_inputs = (next(iter(data_loader))[0],)
ref = quantized_model(*example_inputs)
top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("[before serialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

# 1. Export the model and Save ExportedProgram
pt2e_quantized_model_file_path = saved_model_dir + "resnet18_pt2e_quantized.pth"
# capture the model to get an ExportedProgram
quantized_ep = torch.export.export(quantized_model, example_inputs)
# use torch.export.save to save an ExportedProgram
torch.export.save(quantized_ep, pt2e_quantized_model_file_path)


# 2. Load the saved ExportedProgram
loaded_quantized_ep = torch.export.load(pt2e_quantized_model_file_path)
loaded_quantized_model = loaded_quantized_ep.module()

# 3. Check results for example inputs and check evaluation accuracy again:
res = loaded_quantized_model(*example_inputs)
print("diff:", ref - res)

top1, top5 = evaluate(loaded_quantized_model, criterion, data_loader_test)
print("[after serialization/deserialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

输出：

[before serialization] Evaluation accuracy on test dataset: 79.82, 94.55
diff: tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

[after serialization/deserialization] Evaluation accuracy on test dataset: 79.82, 94.55

量化模型的调试

您可以使用帮助在急切模式和 FX 图模式中进行调试的数值套件。数值套件的新版本正在与 PyTorch 2 导出模型一起开发中。

降低和性能评估

在此阶段产生的模型不是在设备上运行的最终模型，而是一个参考量化模型，它捕获了用户意图的量化计算，表示为 ATen 运算符和一些额外的量化/去量化运算符，为了得到在真实设备上运行的模型，我们需要降低模型。例如，对于在边缘设备上运行的模型，我们可以使用委托和 ExecuTorch 运行时运算符进行降低。

结论 ¶

在本教程中，我们通过使用 XNNPACKQuantizer 在 PyTorch 2 导出量化中完成了整体量化流程，并得到了一个可以进一步降低到支持使用 XNNPACK 后端的推理后端的量化模型。要为您自己的后端使用此功能，请首先遵循本教程并为您自己的后端实现一个 Quantizer ，然后使用该 Quantizer 对模型进行量化。