Qualcomm Cloud AI Introduces Efficient Transformers: One API, Infinite Possibilities
With a developer-centric approach, we unveil our latest library: the efficient transformers library, designed to ease the deployment of large language models (LLMs) on Qualcomm Cloud AI 100. This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace hub (developed using HuggingFace transformers library) into inference-ready formats that run efficiently on Qualcomm Cloud AI 100 accelerators.
In this blog post, we go through the journey of taking a supported transformer-based model, trained on any GPU (Graphics Processing Unit) or AI accelerator, and deploying it on Qualcomm Cloud AI 100 instances.
Efficient transformers library includes reimplementation of blocks of foundational models that are optimized for Qualcomm Cloud AI 100. During invocation of the library to port the pretrained models, the blocks of original model are replaced with the reimplemented blocks from the library at Pytorch level. This enables users to retain the ability to finetune, quantize, or make any adaptations, and still have the modeling code optimized for Qualcomm Cloud AI 100.
The efficient transformers library ports the HuggingFace pretrained models based on their ‘class’ which encapsulates the implementation style of the architectural components of the model. Consequently, all HuggingFace models which are direct derivatives of the same foundational class are ported transparently. For example, support for MistralForCausalLM in efficient transformers implies that all models based on MistralForCausalLM are ported on to Qualcomm Cloud AI 100, without any other model specific changes.
The current version library brings in support for CausalLM models available on HuggingFace based on popular architectures like Llama, Mistral, Mixtral, CodeGen, StarCoder, MPT, with support for more foundation model classes coming soon.
Deploying on Qualcomm Cloud AI 100
Optimize -> Export -> Compile
Qualcomm Cloud AI 100 accelerator is a tailor-made highly sophisticated inference product, and its software solution has been built for quickly deploying trained models.
In three simple steps Optimize -> Export -> Compile, a pre-trained model in any framework can be made inference friendly and deployed on Qualcomm Cloud AI 100. Qualcomm Cloud AI 100 tools hide away all the complexities and intricacies involved and give out-of-the-box performance in most of the use cases.
Where exporting, compiling, and deploying models are fundamental steps, efficient transformers library brings all those steps within a single step, as it handles the conversion and optimization intricacies behind the scenes.
Furthermore, the library adds a layer of intelligence to the deployment pipeline, automatically identifying and implementing model-specific optimizations tailored to each pretrained model. The library streamlines this process, guaranteeing best performance with minimal effort.
Deploying models
Once the compiled network satisfies the KPI (Key Performance Indicator) for the given use case, it can be deployed with the help of inference servers like Triton and orchestrated through Kubernetes. Virtualization and Docker support is available as part of the Platform SDK (Software Development Kit) shown in Figure 4. A C++ and Python API runtime shipped with the Platfrom SDK can be used to create end-to-end applications for inference deployment. Several examples are also available in the online user guide explainng how models can be deployed for inference on Qualcomm Cloud AI 100.
LLMs on Qualcomm Cloud AI 100
Efficient deployment of LLMs on inference accelerators need to address multiple challenges without sacrificing performance. For example,
- Instead of recalculating KV values, maintain a growing cache of KV values (KV$) on the accelerator between LLM decoding steps
- Though the above would naturally lead to variable-sized tensors, generate only fixed-shape tensors to enable ahead-of-time (AOT) AI 100 compiler to generate performant code
- Auto-detect and efficiently handle various KV$ layouts of models, like
- [batch, context_length, num_heads, head_dim]
- [batch, num_heads, context_length, head_dim]
- [context_length, batch, num_heads, head_dim] and others
- Efficiently manage the memory to write-back computed KV.
- Handle IEEE-fp16 range violations in the pretrained models (usually in fp32).
- Handle input management for rolling buffers in KV$, etc.
All these challenges are abstracted away from the user – just the way it should be.
The library provides simple API interface for compiling and running running models on Qualcomm Cloud AI 100.
Launching Transformers based LLMs on Qualcomm Cloud AI 100
The illustration in this section shows the efficient transformers library in action, for different model types and sizes.
Codegen-2B
Mistral-7B (Single SOC)
CodeLLAMA-34B
Tensor parallelism across 4 SOCs
Multiple LLMs? Why not!
Provide model card names in a JSON, enable models en masse!
There are thousands of derivatives of Llama and Mistral (and other foundational models) on HuggingFace, finetuned for various use cases and datasets.
The illustration below shows how efficient transformers APIs can be used to run multiple models (out of the supported model architectures) in one shot.
User provides names of the models (as specified in their respective model cards) as an input to the library helper application (simple application which invokes .infer() API in a loop), and the library generates optimized inference containers of all those models.
As you see above, the library makes it seamless for developers to run their workloads on Qualcomm Cloud AI 100 cards, using simple APIs.
For developers who like much finer control over workloads, the library also provides low-level developer APIs.
To take it a step further, the library will also be integrated as part of Qualcomm Cloud AI 100 Apps SDK installation. That enables users with a real 1-step model source to inference output, and significatnly reduces the number of steps taken to run a model.
The library is the key enabler between the Product Stack and the MLOps/LLOps open source offerings. The efficient transformers will expose interfaces which will be integrated with Triton, LLM Serving stacks, K8s deployments etc. It is extensively scalable to allow for model specific optimizations as well, without any change to the interfaces or the usage APIs.
Design Ethos
It’s clear that deploying pre-trained LLMs on a specialized inference accelerator would entail some changes to the models to run optimally on the target hardware. Most of the changes boil down to pattern recognitions and sub-graph substitutions.
We identified that the protobuf level graph transformations are not scalable and are highly brittle, as patterns are heavily dependent on the coding style of implementors of models. To mitigate this, Qualcomm Technologies’ efficient transformers library APIs handle graph transformations and optimizations at the model source itself. The library algorithm detects the transformations that the model would need, depending on the model class and the task. Based on this, a series of module substitutions are done, where original model modules are replaced with Qualcomm Technologies’ re-implemented modules.
The key pre-requisite is that the models are implemented with HuggingFace Transformers Library.
Conclusion
As the Deep Learning space continues to expand at a breakneck pace, it’s necessary to continuously evolve the hardware, software, and user experience. For any Software Tool Chain to be widely acceptable, simplicty in use is paramount along with the ability to compile once and deploy on multiple platforms. The simple Training-to-Inference workflow will not only make the life of a developer easier, but also significantly reduce the time and cost of deploying LLMs across different verticals and simplifying the process of meeting the required KPIs.
Qualcomm Technologies’ efficient transformers library offers a streamlined approach to AI deployment, seamlessly integrating into existing frameworks. By handling the intricacies of model-specific optimizations, it simplifies the transition from pretrained models to inference-ready solutions with just a single API call.
With this, the developers can focus on the core aspects of their projects, confident that the inference is handled efficiently and effectively. Whether you are a seasoned professional or new to AI deployment, the library makes the journey smoother and more accessible.
Stay tuned for future blogs that will have detailed information on the Qualcomm Cloud AI 100 SW solution.
