Developer Blog

Accelerate your AI apps: Windows ML on Snapdragon X Elite devices

Written by

Manuj Sabharwal

Written by

Meghana Rao

Written by

Gokul Tonpe

Written by

Vinesh Sukumar

Sep 24, 2025

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Join Developer Discord

Key Features Enhancements of Windows ML for General Availability

AI is transforming the PC experience, and workloads like Agentic AI and large language models (LLMs) are pushing the limits of compute and memory performance. Earlier this year at Microsoft Build, Qualcomm Technologies and Microsoft previewed Windows ML to bring advanced AI capabilities to the PC.

Today, we’re excited to take the next big step: announcing the general availability (GA) of Windows ML. This milestone reflects our deep collaboration with Microsoft to deliver cutting-edge AI experiences right on your device. With these advancements, developers and users can now unlock faster, more efficient, and the latest AI stack on next-generation PCs.

“With Windows ML now in general availabilty, this is a pivotal moment for AI developers on Windows. The new Windows ML runtime not only delivers cutting-edge on-device inference but also simplifies deployment, enabling developers to fully harness advanced AI processors on Snapdragon X Series platforms.

Its unified framework and support for NPUs, GPUs, and CPUs enable exceptional performance and efficiency across Snapdragon Windows PCs with Snapdragon. As agentic AI experiences become mainstream, our deep collaboration with Microsoft is accelerating innovation and bringing the best AI experiences to Windows Copilot+ PCs and soon to our next-generation Snapdragon X2 Elite platforms.”
- Upendra Kulkarni, VP, Product Management, Qualcomm Technologies, Inc.

“Microsoft and Qualcomm are bringing advanced AI to Windows 11 PCs with Windows ML optimized for Snapdragon X Series platforms including the recently announced Snapdragon X2 platform. With Windows ML now generally available, developers can deliver faster, smarter AI experiences unlocking new possibilities for developers and the ecosystem.”

- Logan Iyer, VP, Distinguished Engineer, Windows Platform & Developer

Windows ML is a unified and high-performing local inference framework powered by ONNX Runtime Engine (ORT) with to-the-metal optimizations covering the breadth of the IHV ecosystem. Windows ML automatically binds apps to the best available silicon on client devices.

Key features of Windows ML for silicon management are:

· Handling the distribution of IHV dependencies such as compilers, runtimes and drivers across NPU, GPU and CPU.

· Dynamic binding of apps to available silicon: Apps will no longer need to explicitly enumerate and bind to NPU/GPU/CPU providers. However, flexibility exists for apps to explicitly specify a preferred execution target (e.g., execute on NPU only) if needed.

Alternatively, apps can also leverage smart device selection policies like high performance, high efficiency or minimum power using the using LearningModelDeviceKind API.

Taking advantage of Windows ML APIs on Snapdragon NPU for app acceleration

With just a few lines of code, Windows ML enables developers to quickly develop applications leveraging model inferencing on Snapdragon NPU and other Qualcomm silicon. This section highlights the core building blocks of Windows ML app development. Refer to Windows.AI.MachineLearning for detailed information on Windows ML APIs.

Windows ML APIs for infrastructure management

Initialize Windows AI infrastructure and download the required IHV specific Execution Provider (QNN EP for Qualcomm) binaries into the system. winrt::Microsoft::Windows::AI::MachineLearning::Infrastructure infrastructure;
Download the EPs onto the machine: infrastructure.DownloadPackagesAsync().get();
Register installed Eps with ORT auto executionProviders = infrastructure.LoadExecutionProvidersAsync().get();

Windows ML APIs for session creation and inference

Dynamic system selection of Inference Device: LearningModelDeviceKind deviceKind = LearningModelDeviceKind::Default;
Directed device selection using LearningModelDeviceKind enum

Set Session Options: LearningModelSessionOptions sessionOptions;
Create Session: LearningModelSession session = LearningModelSession{…};
Bind inputs and output resources: LearningModelBinding binding = LearningModelBinding{session};
Evaluate model: LearningModelEvaluationResult results = session.EvaluateAsync(…).get();

ONNX Runtime APIs for session creation and inference

Set Session Options: Ort::SessionOptions session_options;
Allocate input and output Tensors: session.GetInputCount() session.GetOutputCount()
Compile Model: Ort::Status compileStatus = Ort::CompileModel(env, compile_options);
Infer Model: session.Run()

Windows ML packaging and distribution

Windows ML strikes a balance between empowering IHVs to innovate and advance quickly with their implementations, while keeping serviceability and compatibility at the forefront. The Windows ML Runtime and IHV Execution Providers are packaged and distributed separately.

The Windows ML Infrastructure class provides methods to download, configure, and register AI execution providers (EPs) for use with the ONNX Runtime.
IHV Execution providers (EPs) are packaged as separate MSIX components, for example, Qualcomm QNN EP. This design allows execution providers to be updated independently from the Windows ML Runtime components.

Windows ML general availability extensions to Olive for improved LLM Accuracy and Performance

Olive is a high level IHV agnostic framework used to download models from Hugging Face, quantize, optimize for a specific hardware, evaluate, check accuracy and measure performance latency. The ONNX model generated from Olive is optimized for execution on Qualcomm NPU using Windows ML with QNN EP.

Olive provides both a CLI and a python interface with a JSON file as a config input to perform all the necessary graph transformations on the input onnx model. Ex: olive run –config resnet_ptq_qnn_qdq.json

At general availability, Windows ML support for Snapdragon NPU comes with improved accuracy and performance improvements implemented in Olive and ortGenAI for popular LLM models.

With enhancements to the prompt window processing, KV cache population and token generation stages of the LLM pipeline in ONNX Runtime ORT v1.23, models like Phi3.5 provide higher accuracy.

Quantization techniques play a key role in accuracy. GPTQModel quantization and AI Model Efficiency Toolkit (AIMET) quantization has been enabled in Microsoft’s Olive toolchain to improve accuracy of LLMs on Snapdragon NPU.

GPTQ-Model is a production ready LLM model compression/quantization toolkit with hardware accelerated inference support for both CPU and GPU via Hugging Face Transformers, vLLM, and SGLang. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.
AIMET LPBQ quantization: AIMET is a software toolkit for quantizing trained ML models. AIMET improves the runtime performance of deep learning models by reducing compute load and memory footprint. Models quantized with AIMET facilitate its deployment on edge devices like mobile phones or laptops by reducing the memory footprint.

Performance improvement: By finetuning the inter-op and intra-op thread management options in Olive, the performance of several key LLMs has been improved within a close margin of native performance.

Intra-op and inter-op Thread optimization: These parameters control the number of threads used for parallel execution of operations within a single operation (intra-op) and between different operations (inter-op).

The table below lists the Olive recipes of Non GenAI and GenAI models that are included in Windows ML Certification Requirements. Developers can use these recipes to deploy these models optimized for Snapdragon NPUs.

Model Name	Hugging Face Link	Qualcomm Olive Recipe (NPU)
Model Card for CLIP ViT-B/32 - LAION-2B	laion/CLIP-ViT-B-32-laion2B-s34B- b79K · Hugging Face	Olive/examples/vit at rel-0.9.2 · microsoft/Olive
OpenAI Model Card: CLIP 16x16	openai/clip-vit-base-patch16 · Hugging Face	Olive/examples/clip at rel-0.9.2 · microsoft/Olive
OpenAI Model Card: CLIP 32x32	openai/clip-vit-base-patch32 · Hugging Face	Olive/examples/clip at rel-0.9.2 · microsoft/Olive
Vision Transformer (base-sized model)	google/vit-base-patch16-224 · Hugging Face	Olive/examples/vit at rel-0.9.2 · microsoft/Olive
BERT multilingual base model (cased)	google-bert/bert-base-multilingual-cased · Hugging Face	Olive/examples/bert at rel-0.9.2 · microsoft/Olive
INT8 BERT base uncased finetuned MRPC	Intel/bert-base-uncased-mrpc-int8-static-inc · Hugging Face	Olive/examples/bert at rel-0.9.2 · microsoft/Olive
DeepSeek R1 Distill	https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	Olive/examples/deepseek at rel-0.9.2 · microsoft/Olive
Phi 3.5 mini instruct	https://huggingface.co/microsoft/Phi-3.5-mini-instruct	Olive/examples/phi3_5 at rel-0.9.2 · microsoft/Olive
Llama-3.2-1B-Instruct	https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct	Olive/examples/llama3 at rel-0.9.2 · microsoft/Olive
Qwen 2.5 1.5B	Qwen/Qwen2.5-1.5B-Instruct · Hugging Face	Olive/examples/qwen2_5 at rel-0.9.2 · microsoft/Olive

Olive high level developer workflow

1. Convert a HuggingFace model to ONNX

2. Convert dynamic shape inputs to fixed dimension

3. Quantize the model with OnnxStaticQuantizer (OnnxStaticQuantization).

4. Generate ONNX model with embedded EP context (EPContextBinaryGenerator)

5. Split a bigger model into smaller parts (CaptureSplitInfo/SplitModel)

6. Graph transformation available in Olive graph surgeries (GraphSurgeries)

A combination of the following quantization techniques and graph surgeries were used to optimize the performance on Snapdragon:

1. MatmulAddFusion: Fuses Matmul & Add op to GeMM operator, optimized for QNN

2. ReplaceAttentionMask: Clips large negative attention mask values to -1e4 for quantization efficiency

3. RemoveRopeMultiCache: Eliminates redundant RoPE computations across KV cache

4. AttentionMaskToSequenceLengths: To control the context length at runtime

5. SimplifiedLayerNormToL2Norm: Converts basic layer norm op to more efficient L2 norm

6. Weight Rotation: Reduces outliers from weights and hidden states to enhance quantization efficiency.

7. GPTQ: 4-bit per-channel symmetric quantization, reduces transformer layer size while preserving accuracy.

8. ONNX Graph Capture: Exports the model to ONNX for further optimizations.

9. 4-bit Block wise quantization to the embedding layer & LM head.

10. AIMET quantization has been added in Olive to help achieve Native scaling.

Foundry Local: expanding AI capabilities

Foundry Local provides high-performance built-in models and supports custom models optimized for silicon performance. Combined with Windows ML and Azure AI Foundry, developers can integrate AI into Windows apps and services with minimal friction, while taking advantage of Qualcomm Technologies’ hardware acceleration.

For general availability, Qualcomm Technologies is actively working with Microsoft on enabling popular models on Windows AI Foundry targeted for Windows ML on Snapdragon NPU with more models coming in the near future.

Foundry Local CLI

Foundry Local Command Line Interface (CLI) provides simple to use interfaces to manage models hosted on the foundry.

View all available commands with the help option: foundry –help

The CLI organizes commands into three main categories:

Model: Commands for managing and running AI models
Service: Commands for controlling the Foundry Local service
Cache: Commands for managing your local model storage

For a comprehensive guide, refer to Foundry Local.

Windows AI Foundry with Agentic AI on Snapdragon NPU

As part of the Windows ML general availability readiness, Agentic AI applications using Windows AI Foundry have been validated on the Snapdragon NPU. Check out this tutorial from Microsoft for developing a translation app with AI Foundry and Langchain. This application has been validated using the deepseek-r1-distill-qwen-7b-qnn-npu model on Snapdragon NPU.

What’s next?

Beyond Windows ML general availability, Qualcomm Technologies is actively working on expanding the scope of Windows ML on Snapdragon NPU:

Expanding LLMs model coverage on Windows AI Foundry
Enable ISV applications on Windows ML
Improve performance and accuracy of LLMs for current and future architectures

References

NPU Windows on Snapdragon AI Partner

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. AIMET is a product of Qualcomm Innovation Center, Inc.

About the Authors

Manuj SabharwalLeads AI product strategy across Qualcomm’s software stack for Microsoft solutions.

Meghana Rao Staff Product Manager at Qualcomm

Gokul TonpePrincipal Engineer

Vinesh SukumarVP, Product Management of AI/GenAI, Qualcomm Technologies, Inc.