Harnessing Qualcomm Adreno GPU for Generative AI: Open-Source Approach
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
Sign upCome for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.
Join Developer DiscordThe emergence of generative AI represents a significant milestone in technology and innovation. This revolution is defined by machines' ability to create content—such as text, images, music, and even entire virtual worlds—that is nearly indistinguishable from human-made outputs. Driven by advanced algorithms and large datasets, generative AI has pushed beyond the limits of traditional computing, providing remarkable capabilities in creativity, automation, and problem-solving across multiple fie lds.
Among the most groundbreaking advancements in generative AI are large language models (LLMs). These models, trained on diverse and extensive datasets, possess the capability to comprehend and generate human-like text, facilitating a myriad of applications ranging from natural language processing to conversational AI. LLMs, such as the renowned GPT series by OpenAI, range of LlaMa models by Meta and many other organizations have revolutionized how we interact with machines, enabling more intuitive and context-aware communication.
Due to recent advances in technology and increased computation power on edge devices - thanks to increasingly complex high-performance cores sitting on the edge like GPUs, Multicore CPUs and NPUs - clubbed with growing privacy concerns, there is a necessity for large language models (and in general other inferences) to run real time on edge devices.
Modern edge devices like smartphones, IoT devices, etc., are highly capable of running huge LLMs to perform complex inferences locally without relying on cloud-based services. Integration of LLMs into edge devices opens a plethora of new opportunities to provide highly personalized and context-aware interactions in various sectors such as healthcare, automotive, and consumer electronics. The synergy between advanced AI models and edge computing thus represents a paradigm shift towards more responsive, secure, personalized and intelligent technological ecosystems.
Qualcomm Adreno GPU stands as a formidable asset in the realm of deep learning models, offering robust computational capabilities that are crucial for handling the intensive workloads associated with AI applications. Adreno GPUs, designed with advanced graphics and parallel processing architecture, excel in accelerating the machine learning inference processes of deep learning models. Adreno GPU already supports running various deep learning models over OpenCL framework. Additionally, Snapdragon Neural Processing Engine SDK is a high-performance neural network runtime designed to run deep learning models efficiently on Snapdragon mobile platforms. Snapdragon Neural Processing Engine does support Adreno GPU delegate to run deep learning models.
Qualcomm Technologies' dedication to open-source initiatives is evident in its continuous efforts to enhance the capabilities of its hardware through community-driven projects. By championing open-source collaboration, Qualcomm Technologies not only accelerates innovation but also democratizes access to cutting-edge technology. Adreno GPU already offers robust support for standard OpenCL and Vulkan frameworks in addition to support for proprietary extensions that enable efficient and high-performance execution.
Through this article we introduce open-source solution to support Generative AI models on Adreno GPU. We started with MLC-LLM, a Universal LLM Deployment Engine with ML Compilation. MLC-LLM supports a wide range of backends like CUDA, Vulkan, OpenCL. Through our effort, Qualcomm Technologies has improved the performance of OpenCL backend for Adreno GPU significantly. Majority of these enhancements are contributed back to community already. Thanks to the community collaboration, we can now support a variety of Large Language Models on Adreno GPU.
Why MLC-LLM?
MLC-LLM is a community driven project with significant contributions and new model additions. MLC-LLM use TVM (Tensor Virtual Machine) at the core. Qualcomm Technologies contributes and maintains Adreno GPU backed of TVM for years now. Through TVM, we offer an open-source solution for vision (or non-generative AI) models with incredible performance. Now, we are proud to reiterate Qualcomm Technologies’ commitment and dedication towards open source-based solution for generative AI models on Adreno GPU through MLC/TVM.
What we offer
- Enhanced Performance: The optimizations we added to Adreno GPU target in MLC-LLM project significantly improved the LLM performance generically for all the supported models. The performance improvement not only limited to decode (tokens per second) but also for prompt (Time to First Token) processing.
- Platform Coverage: MLC-LLM solution works on Android (or Linux) as well as Windows platforms wherever there is OpenCL 3.0 support. Please refer the documentation for details regarding building and running for respective platforms. Typical windows platforms are Windows 11 devices with Snapdragon X Elite and Snapdragon X Plus processors or platforms and Android smartphones powered by Snapdragon 8 Gen 1/2/3 [processors or platforms], and the latest Snapdragon 8 Elite.
- Documentation: There is sufficient documentation that guides developer with building the large language models for supported platforms as we as instructions to build target binaries like MLC/TVM runtime and other utilities.
- Open-Source: All the optimizations, additional tools and documentation is made available through MLC public repository as well as Qualcomm Technologies hosted and tested public repositories.
Getting Started:
MLC-LLM solution for Adreno GPU is accessible from mainline project located at mlc-llm. A fork of the same tested on Adreno GPU with specialized documentation and additional tools can be accessed at mlc-llm with Qualcomm Technologies additions.
As a value add, we also offer tested built artifacts of MLC and TVM for new model compilation, target utilities at AdrenoGPU Release Artifacts.
Documentation with all instructions is available for the community at Adreno Documentation.
Large Language Model prototyping on Adreno GPU is available for Adreno, Linux and Windows platforms. In general, the model compilation process for Linux and Android platforms happens through Linux host and for Windows platforms we need a windows host.
Instructions are almost similar for both Linux and Windows platforms. The below instructions can help to get started quickly from host setup, SDK installation, model compilation and running the model on target platform.
Linux:
Host Preparation:
conda create -n mlc-venv -c conda-forge "llvmdev=15" "cmake>=3.24" git rust numpy decorator psutil typing_extensions scipy attrs git-lfs gcc=10.4 gxx=10.4 python=3.8
conda activate mlc-venv
pip install torch==2.2.0 torchvision==0.18.0 torchaudio==2.3.0
SDK Installation:
The SDK is available as python packages at Adreno GPU Release Artifacts.
pip install tvm_adreno_mlc_cpu-0.19.dev0-cp38-cp38-manylinux_2_28_x86_64.whl
pip install mlc_llm_adreno_cpu-0.1.dev0-cp38-cp38-manylinux_2_28_x86_64.whl
#Check installation status
python -c "import tvm; print(tvm.__path__)"
python -c "import mlc_llm; print(mlc_llm.__path__)"
You may also download the utils and extract as below
mlc_llm-utils-linux-arm64
├── bin
│ └── mlc_cli_chat
└── lib
├── libmlc_llm_module.so
├── libmlc_llm.so
└── libtvm_runtime.so
Windows:
Host Preparation:
conda create -n mlc-venv -c conda-forge "llvmdev=15" "cmake>=3.24" git rust numpy==1.26.4 decorator psutil typing_extensions scipy attrs git-lfs python=3.12 onnx clang_win-64
conda activate mlc-venv
pip install torch==2.2.0 torchvision==0.18.0 torchaudio==2.3.0
SDK Installation:
pip install tvm_adreno_cpu-0.19.dev0-cp312-cp312-win_amd64.whl
pip install mlc_llm_adreno_cpu-0.1.dev0-cp312-cp312-win_amd64.whl
# Check installation status
python -c "import tvm; print(tvm.__path__)"\
python -c "import mlc_llm; print(mlc_llm.__path__)"
You may also download the utils and extract as below
mlc_llm-utils-win-x86
bin
├── mlc_cli_chat.exe
├── mlc_llm.dll
├── mlc_llm_module.dll
└── tvm_runtime.dll
Sample Model:
Let’s consider a sample model which will be used to demonstrate the entire flow of model compilation, target setup, deploy on target and run the same. Large language models can be downloaded from Huggingface (https://huggingface.co) or any other sources.
Model compilation entirely happen through python utility mlc_llm. This process is common for windows or Linux with minor changes.
Given a Meta-Llama-3-8B-Instruct located under a folder Meta-Llama-3-8B-Instruct
Compilation process has various stages as described below
Model Compilation:
Compilation of Large Language Model consists of configuration generation, parameter quantization and module build ad shown below
Configuration Generation
python -m mlc_llm gen_config \
<SOURCE_MODEL> \
--quantization q4f16_0 \
--conv-template <MODEL_TEMPLATE> \
--prefill-chunk-size 256 \
<ADDITIONAL_OPTIONS> \
-o <MODEL_OUTPUT_PATH>
Parameter Quantization
python -m mlc_llm convert_weight \
<SOURCE_MODEL>
--quantization q4f16_0
-o <MODEL_OUTPUT_PATH>
-o <MODEL_OUTPUT_PATH>
Module Build
python -m mlc_llm compile
<MODEL_OUTPUT_PATH>/mlc-chat-config.json
--device <DEVICE_CONFIG>
-o <MODEL_LIB>
Device configuration (DEVICE_CONFIG) is “android:adreno-so” for Android or Linux platforms and “windows:adreno_x86” for windows platforms.
For example, Meta-Llama-3-8B-Instruct model can be compiled for Adreno GPU Targets (Android targets) on Linux as
# Generate Config
python -m mlc_llm gen_config \
./dist/models/Llama-2-7b-chat-hf \
--quantization q4f16_0 \
--conv-template llama-2 \
--prefill-chunk-size 256 \
-o ./dist/Llama-2-7b-chat-hf-q4f16_0-MLC
# Quantize Parameters
python3 -m mlc_llm convert_weight \
./dist/models/Llama-2-7b-chat-hf \
--quantization q4f16_0 \
-o ./dist/Llama-2-7b-chat-hf-q4f16_0-MLC
# Compile model for Linux / Android Adreno GPU target
python3 -m mlc_llm compile \
./dist/Llama-2-7b-chat-hf-q4f16_0-MLC/mlc-chat-config.json \
--device android:adreno-so \
-o ./dist/libs/Llama-2-7b-chat-hf-q4f16_0-adreno.so
The artifacts we need pick here are the quantized weights located at ./dist/Llama-2-7b-chat-hf-q4f16_0-MLCand the model library located at ./dist/libs/Llama-2-7b-chat-hf-q4f16_0-adreno.so.
Same can be compiled on Windows environment targeting Adreno GPU on windows as
# Generate Config
python -m mlc_llm gen_config \
./dist/models/Llama-2-7b-chat-hf \
--quantization q4f16_0 \
--conv-template llama-2 \
--prefill-chunk-size 256 \
-o ./dist/Llama-2-7b-chat-hf-q4f16_0-MLC
# Quantize Parameters
python3 -m mlc_llm convert_weight \
./dist/models/Llama-2-7b-chat-hf \
--quantization q4f16_0 \
-o ./dist/Llama-2-7b-chat-hf-q4f16_0-MLC
# Compile model for Windows Adreno GPU Target
python3 -m mlc_llm compile \
./dist/Llama-2-7b-chat-hf-q4f16_0-MLC/mlc-chat-config.json \
--device windows:adreno_x86 \
-o ./dist/libs/Llama-2-7b-chat-hf-q4f16_0-adreno.dll
Similarly, Windows artifacts we need to pick are quantized weights located at. /dist/Llama-2-7b-chat-hf-q4f16_0-MLC and model library located at ./dist/libs/Llama-2-7b-chat-hf-q4f16_0-adreno.dll.
Technically, the quantized weights are same for any Adreno GPU target as we use same quantization across.
Deploy and Run on Adreno GPU:
Deploy and Run on Adreno GPU:Model running very similar using the native cli for both Linux (or Android) and Windows targets. Whereas Windows additionally support python cli based run too.
Linux or Android
We use precompiled target cli utils mlc_llm-utils-lihux-arm64.tar.bz2. It has the cli tool mlc_cli and its dependencies.
mlc_llm-utils-linux-arm64
├── bin
│ └── mlc_cli_chat
└── lib
├── libmlc_llm_module.so
├── libmlc_llm.so
└── libtvm_runtime.so
Push these contents to Adreno GPU target running Linux or Android operating system
Also, push build artifacts from compiled artifacts dist/Llama-2-7b-chat-hf-q4f16_0-MLC and dist/libs/Llama-2-7b-chat-hf-q4f16_0-adreno.so host to target.
Now, below command can launch the chat on cli
LD_LIBRARY_PATH=./libs ./mlc_cli_chat --model <PATH to Llama-2-7b-chat-hf-q4f16_0-MLC> --model-lib <PATH to Llama-2-7b-chat-hf-q4f16_0-adreno.so> --device openclWindows
Windows supports Python way of running compiled model as well as native cli approach.
Native Cli
We use precompiled target cli utils mlc_llm-utils-win-x86.tar.bz2. It has the cli tool mlc_cli.exe and its dependencies as listed below.
mlc_llm-utils-win-x86
bin
├── mlc_cli_chat.exe
├── mlc_llm.dll
├── mlc_llm_module.dll
└── tvm_runtime.dll
Now, push build artifacts from compiled artifacts dist/Llama-2-7b-chat-hf-q4f16_0-MLC and dist/libs/Llama-2-7b-chat-hf-q4f16_0-adreno.dllhost to target.
Also, on target below command can launch it for interactive chat
mlc_cli_chat.exe --model <PATH to Llama-2-7b-chat-hf-q4f16_0-MLC> --model-lib <PATH to Llama-2-7b-chat-hf-q4f16_0-adreno-accl.so> --device openclPython Cli
Prepare the target device powered by Snapdragon X Elite processor same as Windows host as described below.
Install Anaconda from https://docs.anaconda.com/anaconda/install/windows/
Create an anaconda environment with below configuration:
conda create -n mlc-venv -c conda-forge "llvmdev=15" "cmake>=3.24" git rust numpy decorator psutil typing_extensions scipy attrs git-lfs python=3.12 onnx clang_win-64
conda activate mlc-venv
Download MLC-LLM (Windows) package from Releases
Now, install the package as shown below:
pip install tvm_adreno_cpu-0.19.dev0-cp312-cp312-win_amd64.whl
pip install mlc_llm_adreno_cpu-0.1.dev0-cp312-cp312-win_amd64.whl
Check the installation status as:
python -c "import tvm; print(tvm.__path__)"\
python -c "import mlc_llm; print(mlc_llm.__path__)"
Now, copy the build artifacts dist/Llama-2-7b-chat-hf-q4f16_0-MLC and dist/libs/Llama-2-7b-chat-hf-q4f16_0-adreno-win.so from host to Adreno Windows device.
Under Anaconda shell with environment mlc-venv execute below command:
python -m mlc_llm chat --device opencl --model-lib ./dist/libs/Llama-2-7b-chat-hf-q4f16_0-adreno-win-accl.so ./dist/Llama-2-7b-chat-hf-q4f16_0-MLC/Model Coverage
|
Snapdragon 8 Gen 3 |
Snapdragon X Elite (Windows) |
Snapdragon 8 Elite |
||||||
Model |
Decode (toks/sec) for 100 |
Encode for 256 prompt |
Decode (toks/sec) for 100 |
Encode for 256 prompt |
Decode (toks/sec) for 100 |
Encode for 256 prompt |
|||
toks / sec |
TTFT |
toks / sec |
TTFT |
toks / sec |
TTFT |
||||
Llama-2-7b-chat-hf |
14 |
88 |
2.89 |
20 |
104 |
2.45 |
14.34 |
92.5 |
2.7 |
Meta-Llama-3-8B-Instruct |
11.1 |
80.5 |
3.16 |
17 |
95 |
2.7 |
12.5 |
84.6 |
3.01 |
gemma-2b-it |
14.77 |
244.7 |
1.02 |
41 |
330 |
0.77 |
33.25 |
293.5 |
0.85 |
Mistral-7B-Instruct-v0.2 |
10.85 |
64.1 |
3.9 |
16.3 |
75 |
3.3 |
11.5 |
71 |
3.5 |
phi-2 |
28.9 |
137.7 |
1.98 |
42.8 |
203.5 |
1.25 |
30 |
145.4 |
1.86 |
Phi-3-mini-4k-instruct |
22.6 |
145.5 |
1.75 |
33.5 |
169.3 |
1.5 |
24.88 |
156.358 |
1.64 |
Qwen-7B-Chat |
12.3 |
72.3 |
3.6 |
15.7 |
101.7 |
2.5 |
11.9 |
72.5 |
3.59 |
llava-1.5-7b-hf |
13.3 |
85.8 |
2.9 |
21 |
119 |
2.15 |
13.1 |
85.6 |
2.9 |
With optimizations being very generic, ideally every model officially supported by MLC-LLM project should work on Adreno GPU targets as well. Here is a list of well know large language models we have benchmarked on Adreno GPU across few platforms.
MLC from community supports a range models as listed below though we didn’t benchmark, or performance evaluated all of these.
Llama from Meta: llama2_7b, llama2_13b, llama2_70b, llama3_1_8b, llama3_1_70b, codellama_7b, codellama_13b, codellama_34b, tinyllama_1b_chat_v0.4, tinyllama_1b_chat_v1.0
Mistral AI: mistral_7b, mistral_7b_v03, Mixtral-8x7B-v0.1
OpenAI: gpt2, gpt2_medium, gpt_bigcode
Together Ai: redpajama_3b_v1
Phi from Microsoft: phi-1_5, phi-2, phi-3_5, phi-3_5-vision
Qewn: qwen, qwen2, qwen2moe, qwen2_0_5b, qwen2_1_5b, qwen2.5_3b, qwen2_7b
Stability Ai: stablelm
Bauchuan Ai: baichuan
InternLM: internlm., internlm2, internlm2_5_7b
Google: gemma2_2b, gemma2_9b, gemma2_27b
SmolLM: smollm_1_7b, smollm_360m, smollm_135m
MiniCPM: minicpm_2b, minicpm_2b_sft_bf16, minicpm-moe-8x2b
Others: rwkv5_3b, orion, llava, chatglm, snowflake-arctic-embed-m, stablelm-2-zephyr-1_6b, starcoder2, aya-23
Future Work
Qualcomm Technologies is steadfast in its commitment to continuous investment in enhancing open-source solutions for Adreno GPUs. By actively contributing to and optimizing projects such as MLC-LLM, Qualcomm Technologies aims to empower developers with the tools and resources needed to fully leverage the capabilities of Adreno GPUs.
In the future, we will continue to increase the model coverage and optimizations driven by our deep understanding of Adreno GPU hardware across performance and power tangents.
Conclusion
We encourage community developers to try out these optimizations and share their feedback on our Developer Discord server. Your experiences and insights are invaluable in helping us refine and improve the capabilities of Adreno GPUs.
Stay tuned for more updates!


