Developer Blog

Introducing the new OpenCL™ GPU backend in llama.cpp for Qualcomm Adreno GPUs

Written by

Hongqiang Wang

Feb 17, 2025

Co-authored with Li He, Alex Angus, Skyler Szot, Shangqing Gu, Shaofei Qi, and Alex Bourd.

Originally published in December 2024, updated in February 2025 with extra information, including new models supported.

We are thrilled to announce the availability of a new OpenCL backend for llama.cpp, the well-recognized open-source project focused on large language model (LLM) inference. This backend, optimized for Qualcomm Adreno GPUs, supports a range of popular LLMs, including the latest DeepSeek distilled R1 model. This achievement represents a significant step forward in enhancing the performance and flexibility of the llama.cpp project for LLM inference within the AI community. Adreno OpenCL backend for Llama.cpp is now officially upstreamed to the open-source community via Codelinaro.

With this update, developers now have two options for running LLM inference workloads on Qualcomm Adreno GPUs: the open-source Machine Learning Compiler (MLC) project, which our team has been actively developing, and the llama.cpp project. For more details on running LLMs with MLC, please refer to this blog: Harnessing Qualcomm Adreno GPU for Generative AI: Open Source approach

The new OpenCL backend has been integrated into the llama.cpp mainline after its initial availability via Codelinaro. It is primarily based on the OpenCL 3.0 standard, with optional features like subgroups to achieve optimal performance. The backend has been well tested and optimized for premium Adreno GPUs and can be easily ported to other vendors' GPUs that support the OpenCL 3.0 standard.

Benefits of leveraging OpenCL for Adreno

OpenCL (Open Computing Language), developed by the Khronos Group, is a widely adopted industry standard that allows developers to write efficient and portable parallel programming code that runs on a wide range of devices, including CPUs, GPUs, NPUs, FPGAs, and more, without needing in-depth knowledge of these devices. OpenCL on GPUs, in particular, has empowered developers to harness the immense parallel computing power of modern GPUs for general-purpose GPU (GPGPU) applications, such as image/video/vision signal processing and AI workloads like convolutional neural networks (CNNs) and large language models (LLMs).

As a key member of the OpenCL working group within the Khronos Group, Qualcomm Technologies, Inc. has been actively involved in the standardization of OpenCL.

Being one of the earliest adopters of the OpenCL standard on mobile GPUs, Qualcomm has supported OpenCL across a wide range of SoC devices, including high-end, mid-range, and low-end Android smartphones, IoT devices (like drones), automotive platforms, and Windows on Snapdragon (WoS) devices.

Qualcomm Technologies, Inc. has also provided a comprehensive set of tools (Snapdragon Profiler), OpenCL SDK examples, and an OpenCL programming guide with best practices to help developers get started with OpenCL on Adreno GPUs.

OpenCL on GPUs opens new avenues for developers to leverage the computational power of Adreno GPUs in Snapdragon devices. Offloading computationally intensive tasks like llama.cpp into the GPU frees up the CPU for other operations. Thanks to OpenCL openness and portability, the time-to-market for solutions can be significantly lowered, making the return on investment (RoI) highly favorable.

Key features and benefits of using the OpenCL backend for llama.cpp

Enhanced Performance: The new backend significantly boosts the performance of llama.cpp on Adreno GPUs, enabling faster computations and more efficient processing.
Broader Compatibility: The backend has been highly optimized for Adreno GPUs. However, the backend would run on all GPUs that support the OpenCL 3.0 standard with subgroup support, ensuring broader compatibility and accessibility.
High flexibility: Users may modify and optimize the backend for different GPUs, as the current solution uses all standard OpenCL features. For example, the backend can use vendor extensions targeting other GPUs.
Open-Source Collaboration: This update is a testament to the power of open-source collaboration. We have worked closely with the community so that this backend meets the needs of developers and users alike.

Tested Supported Models and Platforms.

We have rigorously tested llama.cpp with various large language models to confirm its robustness and performance. These tests include:

Meta’s llama models, including llama 2 & 3 models, with parameters of 7 billion (7B) and 8B, etc.
Gemma 1&2 2B models, Phi3 mini.
Mistral 7B models
Bilingual models like Qwen 1&2 7B, Baichuan 7B.
DeepSeek R1 distilled models

The backend has been tested with many premium devices powered by Snapdragon SOCs:

Laptops running Windows 11 with Snapdragon X Elite and Snapdragon X Plus chips
Android smartphones powered by Snapdragon 8 Gen 1, 2, 3, and the latest Snapdragon 8 Elite

How to Build and Run llama.cpp on Android and Snapdragon X Elite with Windows on Snapdragon

llama.cpp with Adreno OpenCL backend has been well optimized on the Android devices powered by Qualcomm Snapdragon 8 Gen 1, 2, 3, and Elite mobile platforms, as well as the Snapdragon X Elite Compute Platform running on Windows 11. Here are the instructions to build and run llama.cpp on the two platforms.

Steps for Android

List of prerequisite software (other versions may work) and hardware

Ubuntu 22.04
Python3, CMake, Make and Ninja
C/C++ compiler
Android NDK version of 26.3.11579264, and installed in /opt/android-sdk/ndk/26.3.11579264/
An Android device powered by Qualcomm Snapdragon 8 Gen 1, 2, 3, or Elite mobile platforms.

Install NDK

cd ~ 
wget https://dl.google.com/android/repository/commandlinetools-linux-8512546_latest.zip && \ 
unzip commandlinetools-linux-8512546_latest.zip && \ 
mkdir -p ~/android-sdk/cmdline-tools && \ 
mv cmdline-tools latest && \ 
mv latest ~/android-sdk/cmdline-tools/ && \ 
rm -rf commandlinetools-linux-8512546_latest.zip 
 
yes | ~/android-sdk/cmdline-tools/latest/bin/sdkmanager "ndk;26.3.11579264"

Install OpenCL headers and ICD loader

The required files for running OpenCL are not directly available in the NDK distribution. Users must download the OpenCL headers and the ICD loader from the official Khronos® OpenCL repos for free. These files are then used along with Android NDK to build the llama.cpp executables.

mkdir -p ~/dev/llm 
cd ~/dev/llm 
 
git clone https://github.com/KhronosGroup/OpenCL-Headers && \ 
cd OpenCL-Headers && \ 
cp -r CL ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include 
 
cd ~/dev/llm 
 
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \ 
cd OpenCL-ICD-Loader && \ 
mkdir build_ndk26 && cd build_ndk26 && \ 
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \ 
  -DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \ 
  -DOPENCL_ICD_LOADER_HEADERS_DIR=$HOME/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \ 
  -DANDROID_ABI=arm64-v8a \ 
  -DANDROID_PLATFORM=24 \ 
  -DANDROID_STL=c++_shared && \ 
ninja && \ 
cp libOpenCL.so ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android

Build llama.cpp with the Adreno OpenCL backend

cd ~/dev/llm 
 
git clone https://github.com/ggerganov/llama.cpp && \ 
cd llama.cpp && \ 
mkdir build-android && cd build-android 
 
cmake .. -G Ninja \ 
  -DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \ 
  -DANDROID_ABI=arm64-v8a \ 
  -DANDROID_PLATFORM=android-28 \ 
  -DBUILD_SHARED_LIBS=OFF \ 
  -DGGML_OPENCL=ON 
 
ninja

If built successfully, the executable will be located at build/bin

Steps for Snapdragon X Elite with Windows on Snapdragon

List of prerequisite software (other versions may work) and hardware

Visual Studio 2022 (community or professional version)
Python3, CMake and Ninja
LLVM 19 (can be downloaded from https://github.com/llvm/llvm-project/releases/tag/llvmorg-19.1.0)
A laptop powered by Snapdragon X Elite

Install OpenCL headers and ICD loader

The required files for running OpenCL are not directly available in the NDK distribution. Users must download the OpenCL headers and the ICD loader from the official Khronos® OpenCL repos for free.

These files are then used along with Android NDK to build the llama.cpp executables.

mkdir -p ~/dev/llm 
 
cd ~/dev/llm 
git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers 
mkdir build && cd build 
cmake .. -G Ninja ` 
  -DBUILD_TESTING=OFF ` 
  -DOPENCL_HEADERS_BUILD_TESTING=OFF ` 
  -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF ` 
  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" 
cmake --build . --target install 
 
cd ~/dev/llm 
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader 
mkdir build && cd build 
cmake .. -G Ninja ` 
  -DCMAKE_BUILD_TYPE=Release ` 
  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" ` 
  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" 
cmake --build . --target install

Build llama.cpp

mkdir -p ~/dev/llm 
cd ~/dev/llm 
 
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp 
mkdir build && cd build 
 
cmake .. -G Ninja ` 
  -DCMAKE_TOOLCHAIN_FILE="$HOME/dev/llm/llama.cpp/cmake/arm64-windows-llvm.cmake" ` 
  -DCMAKE_BUILD_TYPE=Release ` 
  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" ` 
  -DBUILD_SHARED_LIBS=OFF ` 
  -DGGML_OPENCL=ON 
ninja

If built successfully, the executable will be located at build\bin

Launch the executable

Here is an example of how to run the llama.cpp executable:

./llama-cli -m ggml-model-qwen1.5-7b-chat-Q4_0.gguf -b 128 -ngl 99 -c 2048 -p "Hello"

Note that currently the Adreno OpenCL backend has been optimized for the weights using the Q4_0 quantization scheme. The optimization for weights using other schemes, such as FP16 and Q6, is in progress and we will update soon.

Future Work

Qualcomm team is working on bringing more Adreno specific features into the OpenCL backend. Adreno GPUs support a wide range of extensions that allows better performance and power. For instance, we support features like integer dot product, and on-chip global memory (please refer to the Adreno SDK from Qualcomm Developer).

Conclusion

The addition of the OpenCL GPU backend for Adreno GPUs is a significant step forward for llama.cpp. We are excited to see how this enhancement will be utilized by the community and look forward to your feedback.

Want to know more? Join our Discord community to engage with Qualcomm Technologies’ experts, connect with fellow developers working with our technology and stay updated on the latest developer-focused news and product updates.

Adreno Open Source Graphics GPU

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

About the Author

Hongqiang WangPrincipal Engineer/Manager