备注

点击此处下载完整示例代码

部署中优化视觉 Transformer 模型 ¶

创建于：2025 年 4 月 1 日 | 最后更新：2025 年 4 月 1 日 | 最后验证：2024 年 11 月 5 日

杰夫·唐，吉塔·查汉

视觉 Transformer 模型将自然语言处理中引入的尖端基于注意力的 Transformer 模型应用于计算机视觉任务，以实现各种最先进（SOTA）的结果。Facebook 数据高效图像 Transformer DeiT 是一种在 ImageNet 上训练的用于图像分类的视觉 Transformer 模型。

在本教程中，我们将首先介绍 DeiT 是什么以及如何使用它，然后通过脚本编写、量化、优化以及在 iOS 和 Android 应用程序中使用模型的完整步骤。我们还将比较量化、优化和非量化的非优化模型的性能，并展示在步骤中应用量化和优化的好处。

什么是 DeiT？

卷积神经网络（CNN）自 2012 年深度学习兴起以来一直是图像分类的主要模型，但 CNN 通常需要数亿张图像进行训练才能达到 SOTA 结果。DeiT 是一种视觉 Transformer 模型，它需要更少的数据和计算资源进行训练，以便在执行图像分类时与领先的 CNN 竞争，这得益于 DeiT 的两个关键组件：

在大量数据集上模拟训练的数据增强；
原生蒸馏，允许 Transformer 网络从 CNN 的输出中学习。

DeiT 表明，Transformer 可以成功应用于计算机视觉任务，且数据资源和访问有限。有关 DeiT 的更多详细信息，请参阅仓库和论文。

使用 DeiT 进行图像分类

跟随 DeiT 仓库中的 README.md 获取如何使用 DeiT 进行图像分类的详细信息，或者快速测试，首先安装所需的包：

pip install torch torchvision timm pandas requests

在 Google Colab 中运行时，通过运行以下命令安装依赖项：

!pip install timm pandas requests

然后运行以下脚本：

from PIL import Image
import torch
import timm
import requests
import torchvision.transforms as transforms
from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD

print(torch.__version__)
# should be 1.8.0


model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()

transform = transforms.Compose([
    transforms.Resize(256, interpolation=3),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD),
])

img = Image.open(requests.get("https://raw.githubusercontent.com/pytorch/ios-demo-app/master/HelloWorld/HelloWorld/HelloWorld/image.png", stream=True).raw)
img = transform(img)[None,]
out = model(img)
clsidx = torch.argmax(out)
print(clsidx.item())

输出应该是 269，根据 ImageNet 类别索引到标签的文件，它映射到 timber wolf, grey wolf, gray wolf, Canis lupus 。

现在我们已经验证可以使用 DeiT 模型进行图像分类，接下来让我们看看如何修改模型以便它在 iOS 和 Android 应用上运行。

编写 DeiT 脚本

要在移动设备上使用该模型，我们首先需要编写模型脚本。查看“脚本和优化”菜谱以快速了解。运行以下代码将上一步中使用的 DeiT 模型转换为可在移动设备上运行的 TorchScript 格式。

model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("fbdeit_scripted.pt")

生成了大小约为 346MB 的脚本模型文件 fbdeit_scripted.pt 。

DeiT 的量化

为了显著减小训练模型的大小，同时保持推理精度基本不变，可以对模型进行量化。得益于 DeiT 中使用的 Transformer 模型，我们可以轻松地对模型应用动态量化，因为动态量化最适合 LSTM 和 Transformer 模型（更多详情请见此处）。

现在运行以下代码：

# Use 'x86' for server inference (the old 'fbgemm' is still available but 'x86' is the recommended default) and ``qnnpack`` for mobile inference.
backend = "x86" # replaced with ``qnnpack`` causing much worse inference speed for quantized model on this notebook
model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend

quantized_model = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_quantized_model = torch.jit.script(quantized_model)
scripted_quantized_model.save("fbdeit_scripted_quantized.pt")

这将生成脚本化和量化的模型版本 fbdeit_quantized_scripted.pt ，大小约为 89MB，相较于未量化的 346MB 模型大小减少了 74%！

您可以使用 scripted_quantized_model 生成相同的推理结果：

out = scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
# The same output 269 should be printed

优化 DeiT ¶

在将量化脚本模型用于移动设备之前，需要对其进行优化：

from torch.utils.mobile_optimizer import optimize_for_mobile
optimized_scripted_quantized_model = optimize_for_mobile(scripted_quantized_model)
optimized_scripted_quantized_model.save("fbdeit_optimized_scripted_quantized.pt")

生成的 fbdeit_optimized_scripted_quantized.pt 文件的大小与量化、脚本但未优化的模型大致相同。推理结果保持不变。

out = optimized_scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
# Again, the same output 269 should be printed

使用轻量级解释器 ¶

为了看到轻量级解释器在模型大小缩减和推理速度提升方面能带来多少效果，让我们创建模型的轻量级版本。

optimized_scripted_quantized_model._save_for_lite_interpreter("fbdeit_optimized_scripted_quantized_lite.ptl")
ptl = torch.jit.load("fbdeit_optimized_scripted_quantized_lite.ptl")

尽管轻量级模型的大小与非轻量级版本相当，但在移动设备上运行轻量级版本时，预期的推理速度会提升。

比较推理速度 ¶

要查看四个模型（原始模型、脚本模型、量化并脚本模型、优化量化并脚本模型）的推理速度差异，请运行以下代码：

with torch.autograd.profiler.profile(use_cuda=False) as prof1:
    out = model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof2:
    out = scripted_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof3:
    out = scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof4:
    out = optimized_scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof5:
    out = ptl(img)

print("original model: {:.2f}ms".format(prof1.self_cpu_time_total/1000))
print("scripted model: {:.2f}ms".format(prof2.self_cpu_time_total/1000))
print("scripted & quantized model: {:.2f}ms".format(prof3.self_cpu_time_total/1000))
print("scripted & quantized & optimized model: {:.2f}ms".format(prof4.self_cpu_time_total/1000))
print("lite model: {:.2f}ms".format(prof5.self_cpu_time_total/1000))

在 Google Colab 上运行的结果如下：

original model: 1236.69ms
scripted model: 1226.72ms
scripted & quantized model: 593.19ms
scripted & quantized & optimized model: 598.01ms
lite model: 600.72ms

以下结果总结了每个模型的推理时间和相对于原始模型的百分比降低。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Model': ['original model','scripted model', 'scripted & quantized model', 'scripted & quantized & optimized model', 'lite model']})
df = pd.concat([df, pd.DataFrame([
    ["{:.2f}ms".format(prof1.self_cpu_time_total/1000), "0%"],
    ["{:.2f}ms".format(prof2.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof2.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof3.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof3.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof4.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof4.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof5.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof5.self_cpu_time_total)/prof1.self_cpu_time_total*100)]],
    columns=['Inference Time', 'Reduction'])], axis=1)

print(df)

"""
        Model                             Inference Time    Reduction
0   original model                             1236.69ms           0%
1   scripted model                             1226.72ms        0.81%
2   scripted & quantized model                  593.19ms       52.03%
3   scripted & quantized & optimized model      598.01ms       51.64%
4   lite model                                  600.72ms       51.43%
"""

了解更多 ¶

脚本总运行时间：（0 分钟 0.000 秒）

由 Sphinx-Gallery 生成的画廊