Smart routing with Snapdragon X Series NPUs: A developer’s perspective
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
Sign upCome for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.
Join Developer DiscordCo-written with Dileep Karpur.
For over 20 years, Teknikos has empowered forward-thinking companies, from Fortune 100 leaders to mid-size innovators, by leveraging emerging technologies to solve real business problems. Specializing in touch computing and AI, we develop software that transforms customer and employee engagement through mobile applications, enterprise solutions, and immersive experiences.
In May 2025, I participated in a Microsoft-led event in Mexico City focused on hybrid AI—the orchestration of on-device and cloud-based AI. There, I designed and led a hands-on lab that demonstrated how developers can use the Snapdragon X Series NPUs built into Microsoft Surface devices to accelerate real-world AI scenarios.
You can learn from this use case in this blog post and start running your own hybrid AI applications on Snapdragon X Series NPUs.
Why Snapdragon X Series NPUs matter for Windows developers
The Snapdragon X Series processors are transforming Windows PC AI development with dedicated Qualcomm Hexagon Neural Processing Units (NPUs) delivering up to 45 TOPS of AI performance. With these NPUs, developers can build responsive, privacy-conscious applications that seamlessly blend cloud and edge AI—without compromising on performance or user experience.
The Snapdragon X Elite and Snapdragon X Plus processors feature purpose-built NPUs that offer big advantages over traditional CPU/GPU AI execution:
NPU Performance and Efficiency: Leader in TOPS per Watt
- 45 TOPS NPU for accelerated AI neural network performance
- Over 35x lower latency compared to CPU inference (watch Moises Live)
- Over 100x power reduction in power consumption vs. CPU (Watch Qualcomm Technologies keynote at Computex)
- Concurrent AI workload processing without impacting GPU and CPU performance
Developer Integration:
- Native Windows Machine Learning (Windows ML) framework support
- ONNX Runtime integration with the QNN Execution Provider
- Seamless Visual Studio development workflow
- WinUI3 and Python compatibility
Use Case: Hybrid AI chatbot with dynamic routing
Our goal was to demonstrate how the same ONNX model can be deployed across cloud and edge environments, while tapping into the powerful onboard NPU via Qualcomm Technologies’ QNN Execution Provider (QNN EP) to deliver low-latency, privacy-conscious, resilient AI experiences.
We built a demo chatbot that acts like an internal helpdesk assistant. The chatbot first classifies user messages by intent (i.e., which department this message is for) using a zero-shot classification model, and then generates a relevant response using a language model prompted to speak in the voice of someone who works in that department.
Both the classification task and the language generation were designed to run either in the cloud or on device, with no change to the core application logic.
For classification, we used a fine-tuned DeBERTa model (via Hugging Face) optimized for ONNX. For language generation, the app dynamically switched between GPT-4o-mini in Azure and Phi Silica, a compact, on-device language model written specifically to use an NPU.
Crucially, the intent classification model was accelerated on device using the QNN EP, giving us access to the Snapdragon X Series NPU for inference. This gave the chatbot sub-second classification performance even in offline or limited-connectivity environments.
From Hugging Face to NPU acceleration
The lab walked participants through:
- Converting Hugging Face transformer models to ONNX using Python-based tools
- Running ONNX models on device using QNN EP to utilize the NPU
- Deploying the same model to Azure Machine Learning for cloud comparison
The Phi Silica language model only runs on the NPU, and having the ONNX Runtime use the NPU in .NET was simple to implement, requiring only the following lines of code (from LocalClassifierService.cs in the downloadable source files):
var sessionOptions = new SessionOptions();
Dictionary<string, string> options = new()
{
{ "backend_path", "QnnHtp.dll" }
};
sessionOptions.AppendExecutionProvider("QNN");
_inferenceSession = new InferenceSession(modelFile, sessionOptions);
Note that using Phi Silica has some additional requirements, which you can find in its documentation. No additional help was required from Qualcomm Technologies, although one thing to keep in mind is that the ONNX Runtime may silently fall back to CPU execution if the supplied model is not compatible with NPU usage. You may need to further adjust your model using optimization and quantization techniques.
Here’s a sample of the Python inference code (from Lab.ipynb in the _GOODIES INSIDE folder in the downloadable source code) using the QNN EP to run locally:
import onnxruntime
import numpy as np
import torch
import gc
from transformers import AutoTokenizer
# Parameters
text = "How many calories are in your hamburger?"
candidate_labels = ["sales", "customer support", "technical support", "accounting", "marketing", "shipping and orders"]
model_path = r"..\AImodels\zero_shot_classifier.onnx"
tokenizer_name = "MoritzLaurer/deberta-v3-large-zeroshot-v1"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
# Use the QNN (aka Qualcomm AI Engine Direct) execution provider
options = onnxruntime.SessionOptions()
session = onnxruntime.InferenceSession(model_path, sess_options=options, providers=["QNNExecutionProvider"], provider_options=[{"backend_path":"QnnHtp.dll"}])
# Create hypotheses
template = "This example is {}."
hypotheses = [template.format(label) for label in candidate_labels]
# Tokenize
encoded = tokenizer(
[text] * len(candidate_labels),
hypotheses,
return_tensors="np",
truncation=True,
padding=True
)
# Run ONNX model
logits = session.run(
None,
{
"input_ids": encoded["input_ids"],
"attention_mask": encoded["attention_mask"],
}
)[0]
del session
# Use entailment scores (index 0) and normalize with softmax
entailment_scores = logits[:, 0]
probs = torch.softmax(torch.from_numpy(entailment_scores), dim=0).numpy()
# Zip labels and scores
results = sorted(zip(candidate_labels, probs), key=lambda x: x[1], reverse=True)
# Display
for label, score in results:
print(f"{label}: {score:.4f}")
gc.collect()
Smart AI routing with evaluators
To round out the lab, we introduced evaluators—logic that dynamically chooses whether to use cloud or on-device AI based on runtime conditions like connectivity, privacy, or device capability.
In our Windows desktop app (built with WinUI3), we used two evaluators:
- Connectivity: Fall back to on-device inference if the cloud endpoint is unavailable or slow
- Privacy: Keep data local if sensitive patterns (e.g., account numbers) are detected
The C# implementation (from LocalClassifierService.cs in the downloadable source files) mirrored our Python example, also using the QNN EP for the inference session to ensure hardware-accelerated AI execution on device.
var scores = new List<float>();
foreach (var label in labels)
{
var (inputIds, attentionMask) = Tokenize(input, label);
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input_ids", new DenseTensor<long>(inputIds, new[] { 1, inputIds.Length })),
NamedOnnxValue.CreateFromTensor("attention_mask", new DenseTensor<long>(attentionMask, new[] { 1, attentionMask.Length }))
};
using IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results = _inferenceSession.Run(inputs);
var logits = results.First().AsEnumerable<float>().ToArray();
scores.Add(logits[0]);
}
// Softmax (scores --> relative probaibilties) across all entailment scores
float max = scores.Max(); // for numerical stability
float[] exp = scores.Select(s => MathF.Exp(s - max)).ToArray();
float sum = exp.Sum();
float[] probs = exp.Select(e => e / sum).ToArray();
// Final labeled results
var resultsWithLabels = labels.Zip(probs, (label, prob) => (label, prob))
.OrderByDescending(x => x.prob)
.ToList();
// Sort and display
var sorted = resultsWithLabels.OrderByDescending(r => r.prob);
return sorted.ToList();
Results and performance benefits
We closed the session with a live demo, showing how the app dynamically switched between cloud and edge execution based on real-world scenarios—powered by the Snapdragon X Series NPU. The feedback was clear: hybrid AI is here, and on-device acceleration enables responsive, private, cost-effective AI applications.
Comparing benchmarks of the ONNX model accelerated on the NPU, and the same ONNX model hosted in the cloud, participants saw first-hand how on-device NPU execution offers huge latency and efficiency benefits, without compromising accuracy. The excerpt from the application log shown below illustrates the benefits.
Running in the cloud, the evaluator starts at :40.704 and generates an Azure OpenAI response at :42.672 – approximately 2 seconds of execution time. Running on the local device, the evaluator starts at :53.486 and generates a Phi Silica response at :53.680 – slightly less than .2 seconds of execution time.
In other words, the chatbot runs about ten times faster on the device than in the cloud.
Your turn
The full source code, Jupyter notebook, and slides are available at https://github.com/jonathankhootek/HelpChatLab_public. Make sure to check the README.md for setup instructions.
Running the full application requires an Azure subscription with an LLM deployment and an Azure Machine Learning workspace hosting the ONNX model, but you can get started right away by running the Lab.ipynb notebook. It walks you through generating and executing the ONNX model on the Snapdragon X Series NPU—no cloud setup required. You can even open Task Manager to watch the NPU in action.
Try it yourself and see how easy it is to bring real AI workloads to the edge with NPU acceleration.
Whether you're building privacy-first applications, making offline-capable tools, or simply trying to lower inference costs, now’s the time to start exploring hybrid AI.
The Future of Windows AI Development
Snapdragon X Series processors represent a paradigm shift for Windows AI development. By bringing datacenter-class AI performance directly to the PC, developers can build applications that are simultaneously more responsive, more private, and more cost-effective than traditional cloud-only approaches.
The combination of powerful NPU hardware, mature development tools, and seamless Windows integration creates unprecedented opportunities for AI-powered Windows applications. Whether you're building productivity tools, creative applications, or enterprise solutions, the Snapdragon X Series NPU provides the performance foundation for next-generation AI experiences.
Are you already experimenting with Qualcomm tools? We’d love to see what you’re working on. Join the Qualcomm Developer Discord to share your progress, ask questions, and connect with other developers pushing the boundaries.

