Improve RAG performance with torch.compile on AWS Graviton Processors

由 Sunita Nadampalli（AWS）、Ankith Gunapal（Meta）、Hamid Shojanazeri（Meta）撰写

大型语言模型（LLMs）在大量数据上训练，并使用数十亿参数来支持回答问题、翻译语言和完成句子等任务。在使用LLMs时存在一些挑战，例如领域知识差距、事实性问题以及幻觉，这些问题影响了它们的可靠性，尤其是在需要高度准确性的领域，如医疗保健、法律或工程。检索增强生成（RAG）通过在不重新训练模型的情况下，将特定领域或组织内部知识库增强到LLMs中，从而提供了一种缓解这些问题的解决方案。

RAG 知识源通常是特定于业务的数据库，通常部署在通用 CPU 基础设施上。因此，在通用 CPU 基础设施上与相关业务服务一起部署 RAG 既高效又经济。基于这种动机，我们评估了在 AWS Graviton 基于的 Amazon EC2 实例上部署 RAG，这些实例在包括数据库、内存缓存、大数据分析、媒体编解码器、游戏服务器和机器学习推理在内的多数工作负载中，与同类实例相比，提供了高达 40%的价格性能优势。

以前我们发布了几篇关于如何优化 PyTorch 以加速 AWS Graviton 处理器上的 ML 推理性能的博客文章，包括急切模式（博客）和 torch.compile 模式（博客）。在这篇博客中，我们将介绍如何使用 PyTorch 和 torch.compile 部署典型的 RAG 工作负载，我们如何将其性能提高了 1.7 倍（对于嵌入模型）和 1.3 倍（对于 RAG 查询）与默认的 PyTorch“急切模式”相比，以及一些你可以应用于你的 RAG 用例的建议。

如何优化 RAG？

没有 RAG，LLM接收用户输入并根据其训练信息（已知信息）生成响应。有了 RAG，引入了一个信息检索组件，该组件利用用户输入首先从新的数据源中检索信息。用户查询和相关信息都提供给LLM。LLM利用新知识和其训练数据生成更好的响应。以下图表显示了使用 RAG 与LLMs的概念流程。

Image 1: Conceptual flow of using RAG with LLMs

图像 1：使用 RAG 与LLMs的概念流程

来源：https://aws.amazon.com/what-is/retrieval-augmented-generation/

嵌入式模型

RAG 的核心是一个嵌入模型，它将文本数据转换为向量表示。这些向量随后存储在向量数据库中。当用户进行查询时，查询首先被转换为向量，然后 RAG 在向量数据库上执行相似度搜索。因此，优化 RAG 性能的第一步是优化嵌入模型的推理性能。我们使用了基于 AWS Graviton3 的 m7g.xlarge 实例和 HuggingFace sentence-transformer 嵌入模型进行优化工作。以下是使用 PyTorch Eager 模式分析 HuggingFace sentence-transformer 嵌入模型推理的示例脚本。

import torch
from torch.profiler import profile, ProfilerActivity, record_function
from transformers import AutoModel, AutoTokenizer

model_name = "sentence-transformers/all-mpnet-base-v2"
input_text = ["This is an example sentence", "Each sentence is converted"]

model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoded_input = tokenizer(
    input_text, padding=True, truncation=True, return_tensors="pt"
)

warmup, actual = 100, 100
model.eval()

with torch.no_grad():
    # warmup
    for i in range(warmup):
        embeddings = model(**encoded_input)

    with profile(activities=[ProfilerActivity.CPU]) as prof:
        with record_function("model_inference"):
            for i in range(actual):
                embeddings = model(**encoded_input)
        print(prof.key_averages().table(sort_by="self_cpu_time_total"))

热切模式

由于 PyTorch 热切模式已经在 AWS Graviton 处理器上进行了优化，并且具有以下运行环境设置，因此我们将其包含在基线中，并测量了以下性能。有关如何在 AWS Graviton 处理器上优化 PyTorch 热切模式的更多详细信息，请参阅《使用 AWS Graviton 处理器优化 PyTorch 2.0 推理》。

# Enable the fast math GEMM kernels, to accelerate fp32 inference with bfloat16 gemm
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Enable Linux Transparent Huge Page (THP) allocations,
# to reduce the tensor memory allocation latency
export THP_MEM_ALLOC_ENABLE=1

# Set LRU Cache capacity to cache the primitives and avoid redundant
# memory allocations
export LRU_CACHE_CAPACITY=1024

---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                aten::addmm        61.01%        2.638s        62.49%        2.702s     370.197us          7300  
            model_inference        12.01%     519.161ms       100.00%        4.324s        4.324s             1  
                  aten::bmm         6.25%     270.084ms        11.96%     517.089ms     215.454us          2400  
               aten::select         3.98%     172.165ms         5.34%     230.863ms       1.331us        173500  
                aten::copy_         2.11%      91.133ms         2.11%      91.133ms       6.200us         14700   
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.324s

表 1：在基于 AWS Graviton3 的 m7g.xlarge 实例上使用 PyTorch 热切模式对 HuggingFace sentence-transformer 嵌入模型推理进行剖析的输出

接下来，我们增加了 torch.compile ，权重预打包，以及 torch.inference_mode ，并观察到大约 1.7 倍的性能提升。下一节将讨论这些优化以及带来的速度提升。

torch.compile

与急切模式相比， torch.compile 将整个模型预编译成一个单一的图，这种编译方式针对在特定硬件上运行进行了优化。请参阅《在 AWS Graviton 处理器上使用 torch.compile 加速 PyTorch 推理》以获取更多关于 torch.compile 功能以及我们如何在 AWS Graviton 处理器上优化它们的详细信息。按照以下代码片段调用 torch.compile 以触发模型的 PyTorch 动态编译。这使性能从基线提升了大约 1.04 倍。

model = torch.compile(model)

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 aten::addmm        64.46%        2.675s        66.66%        2.766s     378.905us          7300  
       Torch-Compiled Region        19.76%     820.085ms        99.04%        4.109s      41.094ms           100  
                   aten::bmm         6.66%     276.216ms        12.52%     519.527ms     216.470us          2400  
                aten::select         3.98%     164.991ms         5.41%     224.488ms       1.299us        172800  
            aten::as_strided         1.66%      69.039ms         1.66%      69.039ms       0.383us        180100  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.149s

表 2：在基于 AWS Graviton3 的 m7g.xlarge 实例上使用 torch.compile 模式对 HuggingFace sentence-transformer 嵌入模型进行推理的 Profiler 输出

权重预打包

在模型编译过程中将模型权重预打包成更适合给定硬件的格式，从而提高性能，打开机会。设置以下配置以触发权重预打包。这使性能提升了约 1.69 倍。

import torch._inductor.config as config
config.cpp.weight_prepack=True
config.freezing=True

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
    mkldnn::_linear_pointwise        39.10%     994.821ms        41.50%        1.056s     144.628us          7300  
        Torch-Compiled Region        35.12%     893.675ms        98.42%        2.504s      25.043ms           100  
                    aten::bmm        10.96%     278.859ms        21.66%     551.073ms     229.614us          2400  
                 aten::select         7.34%     186.838ms         9.98%     253.840ms       1.469us        172800  
             aten::as_strided         2.63%      67.002ms         2.63%      67.002ms       0.388us        172800   
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.544s

表 3：在 AWS Graviton3-based m7g.xlarge 实例上使用 torch.compile 和权重预打包进行 HuggingFace sentence-transformer 嵌入模型推理的 Profiler 输出

torch.inference_mode

此外，使用 torch.inference_mode() 来从关闭张量的版本控制和张量视图跟踪中节省资源。请参阅 PyTorch 文档以获取更多详细信息。

with torch.inference_mode():
# instead of
with torch.no_grad():

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
    mkldnn::_linear_pointwise        38.92%     987.276ms        41.17%        1.044s     143.056us          7300  
        Torch-Compiled Region        34.92%     885.895ms        98.45%        2.498s      24.975ms           100  
                    aten::bmm        11.25%     285.292ms        22.22%     563.594ms     234.831us          2400  
                 aten::select         7.74%     196.223ms        10.22%     259.251ms       1.500us        172800  
             aten::as_strided         2.48%      63.027ms         2.48%      63.027ms       0.365us        172800  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.537s

表 4：在 AWS Graviton3-based m7g.xlarge 实例上使用 torch.compile、权重预打包和 inference_mode 进行的 HuggingFace sentence-transformer 嵌入模型推理的 Profiler 输出

下表显示了独立嵌入模型推理所实现的增量性能提升。

优化等级	延迟测量（秒）	基准线上的改进
PyTorch 急切模式（基准）	0.04324	NA
torch.compile	0.04149	1.04 倍
权重预打包	0.02544	1.69 倍
torch.inference_mode	0.02537	1.70 倍

以下脚本是一个更新后的示例，用于嵌入模型推理，其中包含了之前讨论的优化。优化内容以绿色突出显示。


从 torch.profiler 导入 profile、record_function、ProfilerActivity，从 transformers 导入 AutoTokenizer、AutoModel，导入 torch._inductor.config as config，设置 config.cpp.weight_prepack=True，config.freezing=True，model_name = "sentence-transformers/all-mpnet-base-v2"，input_text = ['这是一个示例句子', '每个句子都被转换']，加载预训练模型 model，加载预训练分词器 tokenizer，对 input_text 进行编码得到 encoded_input，进行模型预热，实际运行 100 次，将模型设置为评估模式，使用 torch.compile 编译模型，进入推理模式，使用 with torch.no_grad()进行预热，预热 100 次，使用 with profile(activities=[ProfilerActivity.CPU]) as prof 进行性能分析，使用 record_function("model_inference")记录模型推理，实际运行 100 次模型推理，打印性能分析结果

CPU 上的端到端 RAG 场景

在优化嵌入模型推理后，我们开始使用基于 PyTorch eager 模式的 RAG 配置，主要是为了在 CPU 后端验证其功能。我们使用 HuggingFaceEmbeddings 从 langchain_community.embeddings 构建了 RAG 解决方案，如下代码片段所示。

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from langchain.prompts import PromptTemplate
from langchain_core.prompts import format_document
from bs4 import BeautifulSoup as Soup
import torch

url =  "https://maskerprc.github.io/blog/pytorch2-5/"
chunk_size = 1000
chunk_overlap = 0
embedding_model = "sentence-transformers/all-mpnet-base-v2"
N = 5

question = "What's new in PyTorch 2.5?"

from transformers import AutoTokenizer, AutoModel
from typing import Any, List

loader = RecursiveUrlLoader(
            url=url, max_depth=3, extractor=lambda x: Soup(x, "html.parser").text
        )       
docs = loader.load()

# Split the document into chunks with a specified chunk size
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
all_splits = text_splitter.split_documents(docs)

# Store the document into a vector store with a specific embedding model
model = HuggingFaceEmbeddings(model_name=embedding_model)

warmup , actual = 100, 100

with torch.inference_mode():
    vectorstore = FAISS.from_documents(all_splits, model)

    for i in range(warmup):
        searchDocs = vectorstore.similarity_search(question, k=N)

    import time

    start = time.time()
    for i in range(actual):
        searchDocs = vectorstore.similarity_search(question, k=N)
    end = time.time()
    print(f"Time for 1 inference is {(end-start)/actual} seconds")

    doc_prompt = PromptTemplate.from_template("{page_content}")
    context = ""
    for i, doc in enumerate(searchDocs):
        context += f"\n{format_document(doc, doc_prompt)}\n"

接下来，我们的目标是使用 torch.compile 和权重预打包优化端到端 RAG 用例，这为独立的嵌入模型推理带来了 1.7 倍的改进。然而，这些优化并没有直接适用于 RAG 场景。

在端到端 RAG 场景中实现类似增益的挑战和解决方案是什么？

挑战 1：模型处理

无法获取使用 HuggingFaceEmbeddings 实例化的模型句柄，包装类也不提供编译 API。因此，我们的应用程序无法调用 torch.compile 来触发 PyTorch 动态编译过程。

解决方案

我们实现了自定义嵌入类，以便我们可以获取模型的句柄。这从 sentence-transformers 实例化了嵌入模型，并保留了句柄以供即时编译或稍后阶段的编译。有了这个，我们能够触发 torch.compile ，从而实现动态编译。

class CustomEmbedding(HuggingFaceEmbeddings):
    
    def __init__(self, **kwargs: Any):
        """Initialize the sentence_transformer."""
        super().__init__(**kwargs)

        # Load model from HuggingFace Hub
        self.client = AutoModel.from_pretrained(self.model_name)
    class Config:
        arbitrary_types_allowed = True


    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Compute doc embeddings using a HuggingFace transformer model.
        Args:
            texts: The list of texts to embed.
        Returns:
            List of embeddings, one for each text.
        """

        texts = list(map(lambda x: x.replace("\n", " "), texts))

        # Tokenize sentences
        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        
        embeddings = self.client(
           **encoded_input, output_hidden_states=True
        )
        embeddings = embeddings.pooler_output.detach().numpy()

        return embeddings.tolist()

# instead of model = HuggingFaceEmbeddings(model_name=embedding_model)
model = CustomEmbedding(model_name=embedding_model)

# torch.compile the model
model.client = torch.compile(model.client)

挑战 2：触发优化

对于典型的推理场景，其中图被冻结且梯度计算被禁用，Torch inductor（我们用于 CPU 的编译后端）会调用硬件特定的优化，如将图重写为更高效的算子、算子融合和权重预打包。尽管 Torch dynamo 能够看到模型并触发通用编译，但它未能触发 Torch inductor 中的这些附加 Fx 传递。

火炬电感器未触发优化遍历的两个主要原因：（1）应用程序未设置 no_grad() 或 inference_mode() ，以便火炬电感器检测到图已冻结；（2）我们在 torch.compile 框架中遇到了限制，如果在编译区域的开始处设置了 no_grad ，那么在调用电感器 Fx 遍历时， torch.compile 将无法检测到它，因为它那时还没有到达 no_grad 区域。请参阅此 GitHub 问题以获取更多详细信息。

解决方案

我们通过将 no_grad() 上下文从模型类内部移动到应用程序代码中来解决这个问题。这样，模型编译按预期进行，并在对稳定推理遍历的急切和编译版本进行性能分析时，实现了大约 1.3 倍的性能提升。

挑战 3：额外的编译

通过之前的修复，查询查找推理性能得到了提升，但基准测试脚本的总体执行时间并没有改善。我们将其归因于 RAG 推理期间模型的冗余编译。进一步深入调查发现，这是由于词嵌入和 RAG 查询阶段之间的批处理大小不匹配造成的。例如，在我们的基准测试脚本中，当数据库被矢量化并存储在向量数据库中时，我们使用了 16 的批处理大小，因此模型被编译成 16xNxK 的形状。而 RAG 查询查找通常是一个形状为 1xNxK 的单个请求。因此，存在批处理大小不匹配（这些张量的“0”维度），触发了查询查找阶段的重新编译。我们通过以下 Torch 日志进行了确认： TORCH_LOGS="recompiles"

TORCH_LOGS="recompiles" python rag_compile.py 
V1103 02:48:08.805986 34281 site-packages/torch/_dynamo/guards.py:2813] [0/1] [__recompiles] Recompiling function forward in site-packages/transformers/models/mpnet/modeling_mpnet.py:502
V1103 02:48:08.805986 34281 site-packages/torch/_dynamo/guards.py:2813] [0/1] [__recompiles]     triggered by the following guard failure(s):
V1103 02:48:08.805986 34281 site-packages/torch/_dynamo/guards.py:2813] [0/1] [__recompiles]     - 0/0: tensor 'L['input_ids']' size mismatch at index 0. expected 16, actual 1

解决方案

Torch dynamo 提供了一个装饰器来标记给定张量的维度为动态，并指定相应的预期值，以避免触发重新编译。例如，指定 input_ids 和 attention_mask 的维度“0”为动态，并指定该维度允许的值为“1”（如下面的代码片段所示），应该可以避免冗余编译。

torch._dynamo.decorators.mark_unbacked(encoded_input['input_ids'], 0)
torch._dynamo.mark_dynamic(encoded_input['input_ids'], 1)
        torch._dynamo.decorators.mark_unbacked(encoded_input['attention_mask'], 0)
torch._dynamo.mark_dynamic(encoded_input['attention_mask'], 1)

然而，在这个特定情况下，Torch 动态装饰器和标记并没有起作用。此外，使用装饰器创建的图会中断。因此，我们添加了一些预热迭代来隐藏编译延迟，并在稳态下分析了查询查找性能。然而，好消息是，在实践中，这种重新编译只会在第一次查询时触发，所以如果数据库大小固定，可能不会影响生产场景。此外，PyTorch AOT Inductor（PyTorch 的新功能）通过 torch.compile 解决了重新编译和预热挑战。在后续的博客中，我们将讨论如何在生产环境中使用 AOT Inductor 来解决这些挑战。

通过这些解决方案，我们能够应用 torch.compile、权重预打包以及针对 AWS Graviton 的特定优化，在端到端 RAG 场景中提高性能，与基线 eager 模式相比提高了 1.3 倍。

部署

如何在 AWS Graviton 基础上的 Amazon EC2 实例上部署 torch 编译的 RAG，以及如何与 TorchServe 一起使用 Llama 进行部署的详细指南可以在 PyTorch 网站上找到。

结论

在这篇博客中，我们介绍了如何在 AWS Graviton3 基于的 EC2 实例上优化嵌入模型推理性能。我们还分享了面临的挑战、我们实施的解决方案以及为 RAG 用例带来的优化，以及最终的速度提升。希望您会尝试一下！如果您在 Graviton 上需要任何 ML 软件的支持，请在 AWS Graviton 技术指南 GitHub 上提交问题。

我们想对 Eli Uriegas 表示感谢，感谢他在使这篇博客文章成为可能方面提供的支持。

作者

Sunita Nadampalli 是 AWS 的首席工程师和 AI/ML 专家。她负责 AWS Graviton 软件性能优化，针对 AI/ML 和 HPC 工作负载。她对开源软件开发以及为基于 Arm ISA 的 SoC 提供高性能和可持续的软件解决方案充满热情。

安克特·古纳帕尔是 Meta（PyTorch）的 AI 合作伙伴工程师。他负责客户支持、TorchServe 的推广和发布工程。他对解决模型推理和模型服务中的生产问题充满热情。他还喜欢将技术复杂的内容提炼成用户友好的格式。

哈米德·肖贾纳泽里领导 Meta 的 AI 框架合作伙伴工程团队。他对构建可扩展的 AI 解决方案充满热情，并专注于使用 PyTorch 解决大规模分布式训练、推理、模型服务和优化的挑战。

在 AWS Graviton 处理器上使用 torch.compile 提高 RAG 性能

如何优化 RAG？

嵌入式模型

热切模式

torch.compile

权重预打包

torch.inference_mode

CPU 上的端到端 RAG 场景

在端到端 RAG 场景中实现类似增益的挑战和解决方案是什么？

挑战 1：模型处理

解决方案

挑战 2：触发优化

解决方案

挑战 3：额外的编译

解决方案

部署

结论

作者

文档

教程

资源