Developer Blog

Accelerating LLAMA inference on mobile CPUs using Qualcomm Matrix Extensions

Written by

VenuGopal Reddy Gundluru

Apr 7, 2026

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Join Developer Discord

Large Language Models (LLMs) have rapidly moved from research prototypes to everyday user-facing features. Whether it is summarization, conversational assistants, or offline reasoning, the challenge is no longer model quality alone, it is how to run these models efficiently under tight latency, power, and memory constraints. On mobile and edge devices, CPUs are a universally available and reliable execution target.

In this post, you’ll learn how to enable and benchmark Qualcomm Matrix Extension (MX) acceleration for Llama models on Snapdragon platforms - and try Qualcomm MX acceleration on Snapdragon in under 30 minutes using a reproducible, step‑by‑step setup.

If you look closely, the core of these models are compute ‑ intensive linear algebra operations, dominated by matrix multiplications. Each large language model (LLM) inference request proceeds through two primary phases:

Prefill phase, during which the model processes the input prompt (e.g., text tokens or image embeddings).
Decode phase, where the model autoregressively generates output tokens one at a time conditioned on the prompt.

From a systems and performance perspective, these phases exhibit distinct characteristics:

The prefill phase is predominantly compute bound, driven by large General Matrix–Matrix Multiplication (GEMM) operations.
The decode phase is typically memory bound, with performance dominated by General Matrix–Vector Multiplication (GEMV) and associated memory accesses.

Together, GEMM and GEMV operations form the computational backbone of transformer based inference. In the following section, we focus on a specific family of LLMs where these execution patterns are particularly pronounced and performance critical.

Llama Models and Rationale for Focus

We focus on the Llama family of models, which span configurations from approximately 1B to 8B parameters and are well suited for edge and on‑device deployment. These models closely reflect the architecture of modern transformer‑based LLMs while maintaining a scale that is practical for resource‑constrained environments.

Llama models are available in multiple quantized representations (e.g., INT8 and INT4), enabling significant reductions in memory footprint and computational cost without materially compromising accuracy. This makes them particularly effective for deployment on devices such as mobile phones and edge systems, where power, memory bandwidth, and compute capacity are limited.

In addition to their architectural relevance, Llama models benefit from strong open‑source ecosystem support, broad tooling compatibility, and widespread adoption across the industry. This combination of flexibility, accessibility, and representativeness makes Llama an ideal benchmark for evaluating performance gains from CPU‑level acceleration mechanisms, such as the Qualcomm Matrix Extension (QMX).

Why Run LLMs on CPUs?

While GPUs are commonly used for both training and inference, CPUs remain the default platform for real‑world LLM deployment, particularly for inference on mobile, edge, and embedded devices. Several factors make CPUs a practical and scalable choice in these environments:

Ubiquitous availability and seamless integration, supported by a mature and extensive software ecosystem
Native support for quantized data types and flexible precision, enabling efficient execution of modern LLMs
Simplified deployment and operational pipelines, with fewer runtime dependencies compared to GPU‑centric stacks

These characteristics make CPUs especially well suited for cost‑effective, reliable, and widely deployable on‑device inference.

Qualcomm Matrix Extention (MX): Purpose‑Built for LLM Acceleration

To address the matrix‑intensive nature of LLM workloads, Qualcomm Technologies introduced the Qualcomm MX a hardware matrix‑multiplication acceleration capability integrated directly into its latest CPU architectures.

Qualcomm MX is compatible with the ARM Scalable Matrix Extension (SME) instruction set and is specifically designed to accelerate transformer inference.

Key capabilities of Qualcomm MX include:

Specialized matrix instructions and tile‑based execution, optimized for high‑throughput GEMM and GEMV operations
Support for low‑precision data types, such as INT8 and BF16, which are well aligned with quantized LLM inference
Efficient compute and memory utilization, helping reduce bottlenecks in attention mechanisms and feed‑forward layers

By enabling high‑performance matrix operations directly on the CPU, Qualcomm MX significantly improves LLM inference efficiency on platforms where GPU resources may be constrained or unavailable.

Experimental setup and metrics:

We evaluated Qualcomm MX performance using llama.cpp, a widely adopted open‑source framework optimized for LLM inference.

Test Configuration

Hardware (QMX): Snapdragon 8 Elite Gen 5 Mobile platform with Qualcomm MX enabled, hereafter referred to as Snapdragon 8 Elite Gen 5 Mobile (Qualcomm MX)
Baseline (NEON): Snapdragon 8 Elite Gen 5 Mobile platform using Neon SIMD instructions only, hereafter referred to as Snapdragon 8 Elite Gen 5 Mobile (NEON)
Threading Configuration:
- 1 thread and 4 threads
- Applied consistently across both prefill and decode phases on both platforms
Models Evaluated:
- Gemma 270M
- Gemma 1B
- Llama 1B
- Gemma 4B
- Llama 3B
- Llama 8B
- All models were evaluated in quantized GGUF format
- Prompt Length (prefill phase): 128 tokens
- Maximum Decode Length: 128 tokens

Metrics

Time‑to‑First‑Token (TTFT)
Measures end‑to‑end latency during the prefill phase (in seconds). TTFT reflects system responsiveness, i.e., the time a user waits before the first output token is generated.
Token Generation Rate
Measures decode‑phase throughput in tokens per second. This metric captures generation smoothness once decoding begins; for reference, average human reading speed is approximately 4–7 tokens per second.

This evaluation methodology enables a controlled comparison between Qualcomm MX accelerated execution and a NEON‑only baseline, isolating the performance impact of Qualcomm MX under realistic edge‑inference conditions across a range of model sizes and threading configurations.

Experimental context

All results are reported as the average over multiple runs. CPU frequencies were fixed (Small core: 3953 MHz, Big core: 4394 MHz) to ensure repeatability. Measurements compare Neon SIMD against Qualcomm MX acceleration on the same Snapdragon 8 Elite Gen 5 Mobile platform.

Quantized 8-bit models: performance impact of Qualcomm MX

Summary

Key observations for Q8 models:

TTFT improves by up to 2.9x (single‑thread) and ~2.0x (4 threads) with Qualcomm MX.
Decode throughput improves modestly (up to ~1.5x) in single‑thread mode.
Multi‑thread decode shows limited scaling, indicating memory and packing constraints.

Model	TTFT Speedup (Max)	Decode Speedup (Max)
Quantized 8-bit, 1 Thread	2.9x	1.5x
Quantized 8-bit, 4 Thread	2.0x	1.05x

TTFT Speedup – Below chart shows that Qualcomm MX benefit grows with model size and is strongest in compute‑bound prefill.

Interpretation: The prefill phase is dominated by large, compute‑intensive GEMM operations, including Q/K/V projections and MLP layers. Because Qualcomm MX directly accelerates these well‑tiled matrix multiplications, it delivers substantial reductions in Time‑to‑First‑Token (TTFT), resulting in the largest observed performance gains.

Decode Throughput (TPS)

Interpretation: The decode phase is dominated by smaller GEMM and GEMV kernels and increasingly constrained by memory access and data packing overheads. While Qualcomm MX continues to provide acceleration benefits, the achievable speedup is naturally limited relative to the prefill phase due to these memory‑bound characteristics.

Quantized 4-bit models: performance impact of Qualcomm MX

Summary

TTFT improves by up to 1.52x (single thread) and up to ~1.34x (4 threads) with Qualcomm MX, reflecting acceleration of compute‑bound prefill kernels.
Decode throughput remains effectively unchanged across both threading configurations, indicating memory bandwidth saturation.
Thread‑level parallelism does not materially improve decode performance, as decode remains dominated by memory access and tensor packing overheads rather than computation.

Model	TTFT Speedup (Max)	Decode Speedup (Max)
Quantized 4-bit, 1 Thread	2.07x	~1.25x
Quantized 4-bit, 4 Thread	1.77x	~1.21x

Compared to Quantized 8-bit, the reduced arithmetic intensity of Quantized 4-bit models shifts inference bottlenecks toward memory bandwidth, limiting the incremental benefits of hardware compute acceleration during decode.

TTFT Speedup

Decode Throughput (TPS)

Across all experiments, three consistent trends emerge:

Qualcomm MX delivers the largest gains in compute‑bound regimes, particularly during prefill and for higher‑precision models.
Decode performance scales weakly with compute acceleration, as it transitions quickly into a memory‑bound regime.
Thread‑level parallelism reduces per‑core compute pressure, diminishing the relative advantage of Qualcomm MX compared to single‑thread execution.

Conclusion

This article evaluated the impact of the Qualcomm Matrix Extension (MX) on Llama based large language model inference on mobile CPUs, using a realistic edge deployment setup and an opensource inference stack.

Our results demonstrate that Qualcomm MX provides substantial latency reductions during the prefill phase, with TimetoFirstToken (TTFT) improvements of up to 2.9x for Q8 models and 1.5x for Q4 models, directly accelerating the compute bound GEMM kernels that dominate attention and feedforward layers.

In contrast, decode phase performance shows limited sensitivity to compute acceleration. As arithmetic intensity decreases, inference increasingly becomes memory bandwidth bound, with performance constrained by weight streaming, tensor packing, and data movement rather than raw compute throughput.

This behavior persists across threading configurations, highlighting that decode performance is fundamentally governed by memory system characteristics rather than matrix compute capability.

Taken together, these results illustrate that hardware matrix acceleration on CPUs is highly effective for reducing end to end latency in LLM inference, especially in the prefill phase and for higher precision quantization regimes.

At the same time, they underscore the importance of holistic system optimization, where compute acceleration must be complemented by improvements in memory bandwidth utilization, data layout, and runtime scheduling to unlock further gains during decoding.

Qualcomm MX enables high performance, low latency LLM inference directly on mobile CPUs, reducing reliance on discrete accelerators and enabling more scalable, power efficient, and deployable on device AI experiences.

Future Directions

Several opportunities emerge from this work:

Memory centric optimizations for decode, including improved weight packing, prefetching, and cache aware layouts, to better complement Qualcomm MX acceleration.
Hybrid quantization strategies, where higher precision is selectively retained in compute critical layers to maximize the benefit of matrix acceleration.
Run time level scheduling and fusion, to better overlap compute and memory operations during decode.
Broader model coverage, including instruction tuned and multimodal variants, to further characterize Qualcomm MX benefits across emerging LLM workloads.

As LLM inference continues to move toward the edge, CPU centric acceleration mechanisms such as Qualcomm MX will play an increasingly important role in delivering responsive, private, and energy efficient AI experiences at scale.

These results highlight that matrix acceleration alone is not sufficient to improve end to end LLM inference performance; memory bandwidth and data layout play a critical role, especially for quantized decoding.

Try QMX acceleration on Snapdragon in under 30 minutes

Step‑by‑step setup and benchmarking on Snapdragon platforms with QMX support

# 1. Clone the repository
git clone https://github.com/ggerganov/llama.cppcd llama.cpp  

# 2. Checkout the specific commit that is verified to work

git checkout 25f40ca65f1aa596f8b1702bbac4bc48a45b87d7 

# 3. Clone the Kleidiai Repo 

git clone https://github.com/ARM-software/kleidiai.git  

# 4. Update the hardcoded KleidiAI source path

# Open the following file in your preferred text editor:

# ggml/src/ggml-cpu/CMakeLists.txt#

# Locate the KLEIDIAI_SRC variable (around line 523) and replace the hardcoded path:

# set(KLEIDIAI_SRC "<Path to Kleidiai repo>/kleidiai")

# Change it to the actual absolute path where your 'kleidiai' folder is located. For example:

# set(KLEIDIAI_SRC "<Path to Kleidiai repo>/kleidiai")

2. Models (we worked with Quantized 4-bit and Quantized 8-bit models)

Download the appropriate models from: Gemma 270M, Gemma 1B, Llama 1B, Gemma 4B, Llama 3B, and Llama 8B in quantized GGUF format

3. Build Instructions for Android devices

We use a specific set of flags to enable KleidiAI and ARMv9-compatible extensions.

Environment Setup:

export NDK_PATH="<path to NDK tools>/android-ndk-r28b"mkdir build && cd build

CMake Configuration for Snapdragon 8 Gen 5:

cmake -DCMAKE_BUILD_TYPE=Release \  -DANDROID_ABI=arm64-v8a \  -DANDROID_PLATFORM=android-29 \  -DCMAKE_TOOLCHAIN_FILE="$NDK_PATH/build/cmake/android.toolchain.cmake" \  -DLLAMA_CURL=OFF \  -DGGML_CPU_KLEIDIAI=ON \  -DGGML_SYSTEM_ARCH=ARM \  -DGGML_CPU_AARCH64=ON \  -DGGML_CPU_ARM_ARCH="armv9.2-a+sve2+sme+dotprod+i8mm" \  -DCMAKE_C_FLAGS="-march=armv9.2-a+sve2+sme+dotprod+i8mm -fno-omit-frame-pointer -g"  \  -DCMAKE_CXX_FLAGS="-march=armv9.2-a+sve2+sme+dotprod+i8mm -fno-omit-frame-pointer -g" \  -DCMAKE_C_COMPILER_TARGET=aarch64-linux-android29 \  -DCMAKE_CXX_COMPILER_TARGET=aarch64-linux-android29 \  -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \  -DGGML_LLAMAFILE=OFF \  ..

Compile:

make -j16

4. Device Setup (ADB)

You must push the shared libraries and executables to the device (you can find these libraries in <path_where_you_created_this_repository>/build/bin). Ensure the destination folder exists.

# 1. Create directory on device
adb -s <device_id> shell "mkdir -p /data/local/tmp/<directory to push the libs>" 

# 2. Push Libraries (Required for runtime)
adb -s <device_id> push ./libllama.so <directory on the target /data/local/tmp/>

adb -s <device_id> push ./libggml-base.so <directory on the target /data/local/tmp/>

adb -s <device_id> push ./libggml.so <directory on the target /data/local/tmp/>

adb -s <device_id> push ./libggml-cpu.so <directory on the target /data/local/tmp/>

adb -s <device_id> push ./libmtmd.so <directory on the target /data/local/tmp/>

adb -s <device_id> push ./libomp.so <directory on the target /data/local/tmp/> 

# 3. Push Executables

adb -s <device_id> push ./llama-batched-bench <directory on the target /data/local/tmp/>

adb -s <device_id> push ./llama-cli <directory on the target /data/local/tmp/> 

# 4. Push Model (Example)

adb -s <device_id> shell "mkdir -p /data/local/tmp/models"

adb -s <device_id> push ./gemma-3-270m-it-Q4_0.gguf /data/local/tmp/models/

5. Benchmarking (SME vs. Neon)

adb -s <device_id> shell cd <directory on the target /data/local/tmp/>

Accelerate Run with SME or NEON

Enable SME optimization using the GGML_KLEIDIAI_SME=1 flag and to run on NEON use GGML_KLEIDIAI_SME=0 in the below command

LD_LIBRARY_PATH=. GGML_KLEIDIAI_SME=1 ./llama-batched-bench -m /data/local/tmp/models/gemma-3-270m-it-Q4_0.gguf -c 2048 -b 2048 -ub 512 -npp 512 -ntg 200 -npl 1 -t 8  -fa on

Try it on Snapdragon

Ready to get hands‑on? Use the steps above to try QMX acceleration on Snapdragon in under 30 minutes by building llama.cpp with SME enabled and running the included benchmarks.

Measure TTFT and decode performance on your own device, then tune model size, quantization, and threading to see how QMX impacts your specific LLM workload.

AI CPU

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Author

VenuGopal Reddy GundluruEngineer, Principal/Manager at Qualcomm Technologies, Inc.