Accelerate your AI apps: Windows ML on Snapdragon X Elite devices
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
Sign upCome for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.
Join Developer DiscordKey Features Enhancements of Windows ML for General Availability
AI is transforming the PC experience, and workloads like Agentic AI and large language models (LLMs) are pushing the limits of compute and memory performance. Earlier this year at Microsoft Build, Qualcomm Technologies and Microsoft previewed Windows ML to bring advanced AI capabilities to the PC.
Today, we’re excited to take the next big step: announcing the general availability (GA) of Windows ML. This milestone reflects our deep collaboration with Microsoft to deliver cutting-edge AI experiences right on your device. With these advancements, developers and users can now unlock faster, more efficient, and the latest AI stack on next-generation PCs.
“With Windows ML now in general availabilty, this is a pivotal moment for AI developers on Windows. The new Windows ML runtime not only delivers cutting-edge on-device inference but also simplifies deployment, enabling developers to fully harness advanced AI processors on Snapdragon X Series platforms.
Its unified framework and support for NPUs, GPUs, and CPUs enable exceptional performance and efficiency across Snapdragon Windows PCs with Snapdragon. As agentic AI experiences become mainstream, our deep collaboration with Microsoft is accelerating innovation and bringing the best AI experiences to Windows Copilot+ PCs and soon to our next-generation Snapdragon X2 Elite platforms.”
- Upendra Kulkarni, VP, Product Management, Qualcomm Technologies, Inc.
“Microsoft and Qualcomm are bringing advanced AI to Windows 11 PCs with Windows ML optimized for Snapdragon X Series platforms including the recently announced Snapdragon X2 platform. With Windows ML now generally available, developers can deliver faster, smarter AI experiences unlocking new possibilities for developers and the ecosystem.”
- Logan Iyer, VP, Distinguished Engineer, Windows Platform & Developer
Windows ML is a unified and high-performing local inference framework powered by ONNX Runtime Engine (ORT) with to-the-metal optimizations covering the breadth of the IHV ecosystem. Windows ML automatically binds apps to the best available silicon on client devices.
Key features of Windows ML for silicon management are:
· Handling the distribution of IHV dependencies such as compilers, runtimes and drivers across NPU, GPU and CPU.
· Dynamic binding of apps to available silicon: Apps will no longer need to explicitly enumerate and bind to NPU/GPU/CPU providers. However, flexibility exists for apps to explicitly specify a preferred execution target (e.g., execute on NPU only) if needed.
Alternatively, apps can also leverage smart device selection policies like high performance, high efficiency or minimum power using the using LearningModelDeviceKind API.
Taking advantage of Windows ML APIs on Snapdragon NPU for app acceleration
With just a few lines of code, Windows ML enables developers to quickly develop applications leveraging model inferencing on Snapdragon NPU and other Qualcomm silicon. This section highlights the core building blocks of Windows ML app development. Refer to Windows.AI.MachineLearning for detailed information on Windows ML APIs.
Windows ML APIs for infrastructure management
- Initialize Windows AI infrastructure and download the required IHV specific Execution Provider (QNN EP for Qualcomm) binaries into the system.
winrt::Microsoft::Windows::AI::MachineLearning::Infrastructure infrastructure; - Download the EPs onto the machine:
infrastructure.DownloadPackagesAsync().get(); - Register installed Eps with ORT
auto executionProviders = infrastructure.LoadExecutionProvidersAsync().get();
Windows ML APIs for session creation and inference
- Dynamic system selection of Inference Device
: LearningModelDeviceKind deviceKind = LearningModelDeviceKind::Default; - Directed device selection using LearningModelDeviceKind enum
- Set Session Options:
LearningModelSessionOptions sessionOptions; - Create Session
: LearningModelSession session = LearningModelSession{…}; - Bind inputs and output resources
: LearningModelBinding binding = LearningModelBinding{session}; - Evaluate model
: LearningModelEvaluationResult results = session.EvaluateAsync(…).get();
ONNX Runtime APIs for session creation and inference
- Set Session Options
: Ort::SessionOptions session_options; - Allocate input and output Tensors
: session.GetInputCount() session.GetOutputCount() - Compile Model
: Ort::Status compileStatus = Ort::CompileModel(env, compile_options); - Infer Model
: session.Run()
Windows ML packaging and distribution
Windows ML strikes a balance between empowering IHVs to innovate and advance quickly with their implementations, while keeping serviceability and compatibility at the forefront. The Windows ML Runtime and IHV Execution Providers are packaged and distributed separately.
- The Windows ML Infrastructure class provides methods to download, configure, and register AI execution providers (EPs) for use with the ONNX Runtime.
- IHV Execution providers (EPs) are packaged as separate MSIX components, for example, Qualcomm QNN EP. This design allows execution providers to be updated independently from the Windows ML Runtime components.
Windows ML general availability extensions to Olive for improved LLM Accuracy and Performance
Olive is a high level IHV agnostic framework used to download models from Hugging Face, quantize, optimize for a specific hardware, evaluate, check accuracy and measure performance latency. The ONNX model generated from Olive is optimized for execution on Qualcomm NPU using Windows ML with QNN EP.
Olive provides both a CLI and a python interface with a JSON file as a config input to perform all the necessary graph transformations on the input onnx model. Ex: olive run –config resnet_ptq_qnn_qdq.json
At general availability, Windows ML support for Snapdragon NPU comes with improved accuracy and performance improvements implemented in Olive and ortGenAI for popular LLM models.
With enhancements to the prompt window processing, KV cache population and token generation stages of the LLM pipeline in ONNX Runtime ORT v1.23, models like Phi3.5 provide higher accuracy.
Quantization techniques play a key role in accuracy. GPTQModel quantization and AI Model Efficiency Toolkit (AIMET) quantization has been enabled in Microsoft’s Olive toolchain to improve accuracy of LLMs on Snapdragon NPU.
- GPTQ-Model is a production ready LLM model compression/quantization toolkit with hardware accelerated inference support for both CPU and GPU via Hugging Face Transformers, vLLM, and SGLang. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.
- AIMET LPBQ quantization: AIMET is a software toolkit for quantizing trained ML models. AIMET improves the runtime performance of deep learning models by reducing compute load and memory footprint. Models quantized with AIMET facilitate its deployment on edge devices like mobile phones or laptops by reducing the memory footprint.
Performance improvement: By finetuning the inter-op and intra-op thread management options in Olive, the performance of several key LLMs has been improved within a close margin of native performance.
- Intra-op and inter-op Thread optimization: These parameters control the number of threads used for parallel execution of operations within a single operation (intra-op) and between different operations (inter-op).
The table below lists the Olive recipes of Non GenAI and GenAI models that are included in Windows ML Certification Requirements. Developers can use these recipes to deploy these models optimized for Snapdragon NPUs.
|
Model Name |
Hugging Face Link |
Qualcomm Olive Recipe (NPU) |
|
Model Card for CLIP ViT-B/32 - LAION-2B |
||
|
OpenAI Model Card: CLIP 16x16 |
||
|
OpenAI Model Card: CLIP 32x32 |
||
|
Vision Transformer (base-sized model) |
||
|
BERT multilingual base model (cased) |
||
|
INT8 BERT base uncased finetuned MRPC |
||
|
DeepSeek R1 Distill |
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
Phi 3.5 mini instruct |
||
|
Llama-3.2-1B-Instruct |
||
|
Qwen 2.5 1.5B |
Olive high level developer workflow
1. Convert a HuggingFace model to ONNX
2. Convert dynamic shape inputs to fixed dimension
3. Quantize the model with OnnxStaticQuantizer (OnnxStaticQuantization).
4. Generate ONNX model with embedded EP context (EPContextBinaryGenerator)
5. Split a bigger model into smaller parts (CaptureSplitInfo/SplitModel)
6. Graph transformation available in Olive graph surgeries (GraphSurgeries)
A combination of the following quantization techniques and graph surgeries were used to optimize the performance on Snapdragon:
1. MatmulAddFusion: Fuses Matmul & Add op to GeMM operator, optimized for QNN
2. ReplaceAttentionMask: Clips large negative attention mask values to -1e4 for quantization efficiency
3. RemoveRopeMultiCache: Eliminates redundant RoPE computations across KV cache
4. AttentionMaskToSequenceLengths: To control the context length at runtime
5. SimplifiedLayerNormToL2Norm: Converts basic layer norm op to more efficient L2 norm
6. Weight Rotation: Reduces outliers from weights and hidden states to enhance quantization efficiency.
7. GPTQ: 4-bit per-channel symmetric quantization, reduces transformer layer size while preserving accuracy.
8. ONNX Graph Capture: Exports the model to ONNX for further optimizations.
9. 4-bit Block wise quantization to the embedding layer & LM head.
10. AIMET quantization has been added in Olive to help achieve Native scaling.
Foundry Local: expanding AI capabilities
Foundry Local provides high-performance built-in models and supports custom models optimized for silicon performance. Combined with Windows ML and Azure AI Foundry, developers can integrate AI into Windows apps and services with minimal friction, while taking advantage of Qualcomm Technologies’ hardware acceleration.
For general availability, Qualcomm Technologies is actively working with Microsoft on enabling popular models on Windows AI Foundry targeted for Windows ML on Snapdragon NPU with more models coming in the near future.
Foundry Local CLI
Foundry Local Command Line Interface (CLI) provides simple to use interfaces to manage models hosted on the foundry.
View all available commands with the help option: foundry –help
The CLI organizes commands into three main categories:
- Model: Commands for managing and running AI models
- Service: Commands for controlling the Foundry Local service
- Cache: Commands for managing your local model storage
For a comprehensive guide, refer to Foundry Local.
Windows AI Foundry with Agentic AI on Snapdragon NPU
As part of the Windows ML general availability readiness, Agentic AI applications using Windows AI Foundry have been validated on the Snapdragon NPU. Check out this tutorial from Microsoft for developing a translation app with AI Foundry and Langchain. This application has been validated using the deepseek-r1-distill-qwen-7b-qnn-npu model on Snapdragon NPU.
What’s next?
Beyond Windows ML general availability, Qualcomm Technologies is actively working on expanding the scope of Windows ML on Snapdragon NPU:
- Expanding LLMs model coverage on Windows AI Foundry
- Enable ISV applications on Windows ML
- Improve performance and accuracy of LLMs for current and future architectures
References
- What is Azure AI Foundry?
- What is Windows AI Foundry?
- Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs
- New WindowsML overview
- QNN Execution Provider
- ModelCloud/GPTQModel quantization
- quic/aimet: AIMET quantization
- inter_op_parallelism_threads and intra_op_parallelism_threads
- Understanding Inter-Op and Intra-Op Parallelism Threads
- Windows on Snapdragon




