Modern machine learning frameworks increasingly rely on highly optimized backend libraries to utilize hardware to the maximum level. One of the most important of these libraries is XNNPACK, a high-performance operator library used across TensorFlow, PyTorch/ExecuTorch, ONNX Runtime, and other ML frameworks.
In this post, we walk through what XNNPACK is, how it plugs into ML frameworks, why Hexagon is such a natural fit for XNNPACK’s microkernel approach – and finally, how our newly added HVX kernels perform on workloads like MobileNet, Attention, and Transformer blocks.
Qualcomm AI at the edge
Qualcomm Technologies, Inc. is driving advancements to unlock performant, efficient, and fast edge AI across devices powered by Snapdragon platform. Our Qualcomm AI Stack – including Qualcomm AI Engine Direct and the Qualcomm Neural Processing SDK – provides the software, runtime, and tools needed to run AI efficiently across NPU, GPU and CPU on Snapdragon.
These solutions deliver the highest performance-per-watt and are the primary path for developers targeting production-grade AI acceleration on devices powered by Snapdragon or Qualcomm processors.
For developers who currently rely on XNNPACK for edge inference on the CPU, we are introducing Hexagon NPU + XNNPACK as an additional, open-source-friendly way to accelerate certain ML workloads.
Hexagon NPU + XNNPACK refers to the integration of architecture-optimized microkernels for the Hexagon processor directly into XNNPACK, enabling certain XNNPACK operators to offload from CPU to the NPU through the standard XNNPACK subgraph API – without requiring changes to upstream ML frameworks.
The integration expands the options available to developers who want to stay entirely within open-source inference runtime. At the same time, it is important to note that Hexagon NPU + XNNPACK is not intended to match the performance-per-watt, breadth of operator coverage, or full-graph optimization capabilities of Qualcomm AI Direct or Qualcomm Neural Processing SDK.
Instead, it offers a lightweight and accessible pathway for accelerating a subset of operations through familiar open-source tooling.
What is XNNPACK?
If you have used TensorFlow Lite, ONNX Runtime, or ExecuTorch, you have probably already used XNNPACK—even if you never called it directly.
XNNPACK is a collection of highly optimized math routines for ML inference, originally from Google. You can see XNNPACK as a low-level performance engine that runs inside ML frameworks (TensorFlow Lite, PyTorch/ExecuTorch, ONNX Runtime, etc.).
XNNPACK focuses on inference, not training and is built around architecture-specific microkernels (ARM NEON-compatible architectures, x86 AVX, WebAssembly, RISC-V, and now Hexagon).
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
What XNNPACK does (and doesn’t) do
When people first hear about XNNPACK, it’s easy to confuse it with a full graph compiler. It isn’t. XNNPACK lives below the graph level. XNNPACK is not a graph compiler! It does not:
- Perform cross-op fusion (e.g., Conv+BN+ReLU, long elementwise chains).
- Do graph rewrites like constant folding or inserting quantize/dequantize nodes.
- Decide op ordering, device placement, or global scheduling.
- Choose quantization strategy or layouts on its own.
All of that is the job of the ML framework (TFLite, PyTorch, ONNX Runtime, etc.). The framework partitions the graph and then hands chunks of it to XNNPACK as “subgraphs” to execute.
How XNNPACK works
XNNPACK focuses on turning those subgraphs into fast kernel calls by providing a subgraph API to build a DAG of operators and lowering each subgraph into calls to optimized microkernels for the chosen architecture. It also provides packing and layout transformations for weights and activations.
Along the way, it performs operator-specific optimizations, such as weight packing (reordering and interleaving for GEMM/conv), selecting the best kernel variant (tile sizes, MR/NR, data type), using fused operators when available (e.g., SoftmaxFused instead of a chain of ReduceMax → Sub → Exp → ReduceSum → Div).
Come for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.
Hexagon NPU
The Hexagon NPU is Qualcomm Technologies’ dedicated on-device AI inference processor, designed to deliver high performance, low power, and efficient execution of generative AI and AI/ML workloads across devices powered by Qualcomm technology.
The Hexagon NPU combines:
- A scalar VLIW (Very Long Instruction Word) core for control-heavy code
- A wide SIMD engine called Hexagon Vector eXtensions (HVX).
HVX adds very wide vector lanes (128-byte vectors) with their own register file and instruction set for packed integer and FP math. Multiple vector instructions can be issued in parallel from the scalar core, then you can keep both scalar control code and vectorized code busy at the same time.
Hexagon pairs HVX with a fast on-chip scratchpad (VTCM) and a cache hierarchy to keep those wide vector units fed. For best performance, data should be laid out and aligned to the native vector width, which is 128 bytes, and common ML data types – Int8/Int16, FP16/FP32 – are natively supported in HVX instructions. Please visit this link to learn more.
Why Hexagon NPU + XNNPACK is interesting
For ML workloads, Hexagon NPU hits a sweet spot: the scalar core handles the code such as loop control and boundary conditions that do not map perfectly to SIMD, while we offload vector-friendly kernels such as GEMM, convolutions, layer norms, SoftMax and key elementwise operations to HVX units to achieve high vector throughput.
Once Hexagon/HVX microkernels land inside XNNPACK, they become available to a broad set of frameworks with no custom patches. TensorFlow Lite uses XNNPACK through its delegates, ONNX Runtime through its execution provider and ExecuTorch uses XNNPACK as a backend.
The same microkernels can therefore serve a wide range of inference runtimes.
This also means improvements can scale. As new tiles, packing algorithms, fused kernels and prefetch optimizations are added, every framework that relies on XNNPACK benefits automatically.
And because XNNPACK sits cleanly beneath the frameworks, new Hexagon architectural features – shuffle improvements, new precision types, wider vector units – can be supported simply by updating the microkernels without altering the surrounding software stack.
We optimize once, and the acceleration reaches many runtimes, therefore we can enable seamless acceleration on Hexagon across Android ecosystems.
What’s currently implemented for Hexagon NPU in XNNPACK
Here’s a high-level view of the Hexagon/HVX microkernels that are wired into production configs today:
Data Type |
Operator Families (HVX) |
FP32 |
GEMM, binary elementwise ops, unary & math, reductions, SoftMax, pack and transpose helpers |
QS8 (Int8) |
GEMM (int8 activations + per-channel int8 weights + int32 accumulators -> fp32 or int8 output), basic quantized add, quantize/dequantize helpers |
QU8 (uint8) |
GEMM (quantize/dequantize helpers for unsigned activation path) |
Table 1. List of currently implemented microkernels for Hexagon in XNNPACK. FP16 is not wired up yet.
The current set of microkernels already enables HVX offloading and achieves strong SIMD utilization, demonstrating the benefits of mapping compute-intensive operators to Hexagon.
These kernels are sufficient to accelerate many XNNPACK subgraphs in practice – MobileNet families, Attention, and Transformer Blocks among them.
At the same time, they are not yet fully tuned with VTCM-aware scheduling or aggressive optimizations; in other words, we are exploiting HVX vector width effectively today, but additional gains can be unlocked by using VTCM more strategically for prepacking data, weights and activations.
Enabling VTCM for microkernels is therefore a clear next step, providing another layer of optimization potential beyond SIMD.
Looking ahead, FP16 is not yet wired up for Hexagon in XNNPACK, but that is also a natural direction given the HVX FP16 capabilities and dot-product instructions. These Hexagon kernels represent a joint effort between Qualcomm Technologies and Google engineers.
Evaluations on benchmarks and subgraphs
Before we get into the results, these are the evaluation configurations we used:
- Device: Samsung Galaxy S24 (Snapdragon 8 Gen 3) with the CDSP clock scaled to 1.92 GHz.
- Baseline: Best available scalar implementation in XNNPACK (“before HVX”).
- Metric: Geometric mean speedup over the scalar baseline, plus min/max across input shapes.
Benchmarks:
- Sparse Matrix Multiplication FP32 (f32-spmm bench)
- QS8-QC8W-FP32 GEMM (int8 activations × per-channel int8 weights → fp32 outputs)
- Subgraph API models (MobileNet, Attention, TransformerBlock, etc.)
Sparse Matrix Multiplication FP32
We used XNNPACK’s SpMM benchmark input sets that mimic real MobileNet layers:
- MobileNet V1 – regular 1×1 pointwise convolutions (GEMM-like)
- MobileNet V2 – inverted residual blocks → many tall-skinny GEMMs (large M, small K)
- MobileNet V3 Small – lots of tiny GEMMs (SE blocks, small channels)
- MobileNet V3 Large – wider GEMMs, more complex shapes (wide/fat matrices, imbalanced dims)
MobileNet V1 and V2 perform especially well as their depthwise-pointwise structure reduces to regular 1x1 GEMM-like layers with high arithmetic intensity. The matrix dimensions in these cases align cleanly with HVX tile sizes and each kernel call does enough work to amortize packing and launch overhead. As a result, HVX lanes remain well utilized and we observe 7~8X speedups over the scalar baseline.
MobileNet V3 represents a very different set of challenges. The network includes many tiny and irregular shaped GEMMs and they tend to have tiny M/N dimensions or imbalanced M/K/N ratios. In such cases, the microkernel spends more time on launch overhead, packing and handling tails, and we encounter low utilization of HVX units. This leads to 2~3X speedup, in the worse cases, some layers may even fall below 1X.
There are clear opportunities to improve performance for these irregular shapes. One possible direction is to introduce small-GEMM microkernels tuned specifically for narrow or uneven dimensions, so they better utilize HVX lanes. It may also be beneficial to avoid offloading layers whose compute-intensity falls below a threshold, leaving those to scalar execution while reserving HVX for the cases where it produces meaningful acceleration.
QS8-QC8W-FP32 GEMM: Microkernel Upper Bound
Now let’s look at qs8-qc8w-fp32 GEMM. This benchmark has three main parts:
- qs8 – quantized signed int8 activations.
- qc8w – per-channel quantized int8 weights.
- fp32 – int8×int8 dot products accumulate in int32, then scale/dequantize to fp32 outputs.
This benchmark is closer to a pure microkernel stress test: no full graph, just GEMM.
We observed very large speedups in this benchmark, 50 ~ 90X, and this can seem almost unreal at first glance, but they make sense when viewed in context.
These measurements reflect the efficiency of large GEMMS with substantial data reuse, prepacked weight layouts, favorable memory locality, and the absence of graph-level overhead.
In other words, this is the “pure kernel” scenario, where HVX can be fully utilized with enough compute work and hot caches and high arithmetic intensity. Under those conditions, it is entirely possible for HVX to approach the theoretical SIMD peak performance relative to a scalar baseline.
We still observe weaker microkernel results such as those from LLM-like shapes, SRCNN, and MobileNet V3 small, telling the other side of the story. These cases are dominated by smaller or irregular GEMMs where work per invocation is limited and bandwidth becomes the bottleneck.
More time is spent moving data than performing arithmetic operators and packing plus tail handling can represent larger fraction of total executions.
It is worth emphasizing that these high microkernel speedups represent the upper limit of what HVX can deliver in isolation. Once we step up to full subgraph execution, where kernel launches, packing work, and memory traffic interact, end-to-end speedups will naturally be lower.
Subgraph evaluation: real workloads, not just kernels
To understand “realistic” benefits, we used the XNNPACK subgraph API and benchmarked end-to-end models:
Models:
- Attention – FP32 and QD8 (Dynamic 8-bit Quantization, FP32 activations are quantized to int8 at runtime)
- MobileNet V1/V2/V3 (small/large) – FP32
- FP32 Elementwise (chains of ops)
- SoftmaxDecomp / SoftmaxFused
- SoftmaxDecomp: ReduceMax → Sub → Exp → ReduceSum → Divide
- SoftmaxFused: Single fused softmax op (optimized kernel)
- SoftmaxDecomp: ReduceMax → Sub → Exp → ReduceSum → Divide
- DepthwiseSeparable FP32: DepthwiseConv + PointwiseConv (MobileNet blocks)
- TransformerBlock FP32 and QD8
We report geometric mean speedups over the best scalar implementation.
Across the full range of subgraph benchmarks, the strongest gains come from workloads that are dominated by compute-heavy and regularly structured kernels.
MovileNet V1, V2 and V3 large show end-to-end speedups of around 8~9X, the DepthwiseSeparable subgraph achieves roughly 10X geomean and can hit 20X in the best cases. Attention subgraph, both FP32 and QD8, land in the 5~6X range. SoftmaxFused shows similar geomean speedups of roughly 5.3X and can reach as high as 22X, which clearly highlights the advantage of using a fused operator rather than a decomposed version.
These subgraphs are the bread-and-butter targets of HVX: GEMM-heavy, convolution-heavy and SIMD-friendly, which explains why they show such consistently strong results.
The weaker cases provide insight into where HVX utilization breaks down. MovileNet V3 small consistently shows the sign that HVX is underutilized across all evaluations.
These kernels are too small to amortize packing, launch and tail handling overheads. As was mentioned above in the qs8-qc8w-fp32 GEMM results, providing specialized small-GEMM microkernels could help this case substantially. We may also want to introduce “dynamic offload” heuristics to keep such small layers on scalar execution.
Another weaker case is FP32 elementwise chains. These operations have very low FLOPs per byte; therefore, SIMD execution offers modest benefit while kernel launch overhead becomes nontrivial.
One possible promising direction here is to fuse multiple elementwise operations into a single kernel.
The QD8 TransformerBlock case illustrates a different bottleneck profile. The GEMM kernels themselves perform well, but weight and activation packing, combined with quantization/dequantization overhead, effectively cancels the SIMD gains.
That is why QD8 TransformerBlock achieves only about 1.1 ~ 3.5X instead of 5~ 6X we might expect. Possible improvements include prepacking weights on the DSP side and caching them and implementing better packing kernels tuned for HVX.
Finally, FP32 TransformerBlocks sits somewhere between these extremes. The GEMM portion already benefits significantly from HVX, but the SoftMax and surrounding steps remain partially unfused, and in some cases still fall back to scalar execution.
The clear path forward is to make sure SoftMaxFused is used where applicable and going further by fusing more of the attention sub-block into a single HVX-friendly code. Longer term, we may want to introduce specialized attention kernels to reduce launch overhead even further.
What’s next
Looking at the current state of HVX integration within XNNPACK, we already see substantial gains on realistic workloads. Across MobileNet, DepthwiseSeparable, and Attention-style subgraphs, the end-to-end speedups consistently fall in the 5 ~ 10X range.
In microkernel isolation, the performance ceiling is even higher; the largest GEMM-centric benchmarks can reach 50 ~ 90X over scalar execution, making it clear that HVX is very well aligned with the arithmetic patterns found in modern CNN Inference.
At the same time, the results shed light on where efficiency tapers off. Small or irregularly shaped kernels, such as those found in MobileNet V3 Small and SE blocks, remain dominated by overhead and consequently degrade performance.
Quantized execution paths reveal the cost of packing and unpacking activations and weights. DepthWiseConvolusion on quantized paths still has room to improve, and although HVX supports FP16 well, the FP16 kernels have not been wired up in XNNPACK yet.
The future direction is therefore well motivated and technically clear.
Additional kernel coverage – particularly FP16 GEMM and convolution using HVX’s FP16 dot product instructions – is a natural next step. Improved packing kernels both for FP32 and INT8 (including qc4w layouts), will reduce front-end overhead and increase effective throughput.
More aggressive use of VTCM for hot data residency should reduce stall cycles while fine-grained prefetching, register allocation and VLIW-aware instruction scheduling by profiling will help extract even more parallelism from the architecture.
The best part of this work is how broadly it propagates. Once these optimized kernels land in XNNPACK, they do not exist in isolation – they immediately benefit any supported runtime that uses XNNPACK as a backend.
TensorFlow Lite, ONNX Runtime, ExecuTorch, and internal runtimes pick up the improvements without vendor-specific patches.
The optimization happens once, but its benefits extend everywhere.
This work is evolving and will continue to expand — community feedback is welcome. Join Developer Discord to share your feedback with the team!
Learn more
1. XNNPACK GitHub Repository: Explore the XNNPACK repo to learn more details, benchmarks, and follow ongoing development in XNNPACK
2. Qualcomm Hexagon HVX Programmer’s Manual: Architecture guides, HVX instruction reference, VTCM usage, and programming best practices.
3. Qualcomm Developer: Tutorials, app notes, and AI optimization guides for development on Hexagon.

