Back to All
Developer Blog

Accelerate your AI apps: Windows ML on Snapdragon X Elite devices

Sign up for Developer monthly newsletter-image

Sign up for Developer monthly newsletter

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Sign up
Come for support, stay for the community-image

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Join Developer Discord

Key Features Enhancements of Windows ML for General Availability

AI is transforming the PC experience, and workloads like Agentic AI and large language models (LLMs) are pushing the limits of compute and memory performance. Earlier this year at Microsoft Build, Qualcomm Technologies and Microsoft previewed Windows ML to bring advanced AI capabilities to the PC.

Today, we’re excited to take the next big step: announcing the general availability (GA) of Windows ML. This milestone reflects our deep collaboration with Microsoft to deliver cutting-edge AI experiences right on your device. With these advancements, developers and users can now unlock faster, more efficient, and the latest AI stack on next-generation PCs.  

“With Windows ML now in general availabilty, this is a pivotal moment for AI developers on Windows. The new Windows ML runtime not only delivers cutting-edge on-device inference but also simplifies deployment, enabling developers to fully harness advanced AI processors on Snapdragon X Series platforms.

Its unified framework and support for NPUs, GPUs, and CPUs enable exceptional performance and efficiency across Snapdragon Windows PCs with Snapdragon. As agentic AI experiences become mainstream, our deep collaboration with Microsoft is accelerating innovation and bringing the best AI experiences to Windows Copilot+ PCs and soon to our next-generation Snapdragon X2 Elite platforms.”
- Upendra Kulkarni, VP, Product Management, Qualcomm Technologies, Inc.

“Microsoft and Qualcomm are bringing advanced AI to Windows 11 PCs with Windows ML optimized for Snapdragon X Series platforms including the recently announced Snapdragon X2 platform. With Windows ML now generally available, developers can deliver faster, smarter AI experiences unlocking new possibilities for developers and the ecosystem.”

- Logan Iyer, VP, Distinguished Engineer, Windows Platform & Developer

Windows ML is a unified and high-performing local inference framework powered by ONNX Runtime Engine (ORT) with to-the-metal optimizations covering the breadth of the IHV ecosystem. Windows ML automatically binds apps to the best available silicon on client devices.

Key features of Windows ML for silicon management are:

·         Handling the distribution of IHV dependencies such as compilers, runtimes and drivers across NPU, GPU and CPU.

·         Dynamic binding of apps to available silicon: Apps will no longer need to explicitly enumerate and bind to NPU/GPU/CPU providers. However, flexibility exists for apps to explicitly specify a preferred execution target (e.g., execute on NPU only) if needed.


Alternatively, apps can also leverage smart device selection policies like high performance, high efficiency or minimum power using the using LearningModelDeviceKind API.

Taking advantage of Windows ML APIs on Snapdragon NPU for app acceleration

With just a few lines of code, Windows ML enables developers to quickly develop applications leveraging model inferencing on Snapdragon NPU and other Qualcomm silicon. This section highlights the core building blocks of Windows ML app development. Refer to Windows.AI.MachineLearning for detailed information on Windows ML APIs.

Qualcomm-image

Windows ML APIs for infrastructure management

  • Initialize Windows AI infrastructure and download the required IHV specific  Execution Provider (QNN EP for Qualcomm) binaries into the system. winrt::Microsoft::Windows::AI::MachineLearning::Infrastructure infrastructure;
  • Download the EPs onto the machine: infrastructure.DownloadPackagesAsync().get();
  • Register installed Eps with ORT auto executionProviders = infrastructure.LoadExecutionProvidersAsync().get();

Windows ML APIs for session creation and inference

  • Dynamic system selection of Inference Device: LearningModelDeviceKind deviceKind = LearningModelDeviceKind::Default;
  • Directed device selection using LearningModelDeviceKind enum
Qualcomm-image
  • Set Session Options: LearningModelSessionOptions sessionOptions;
  • Create Session: LearningModelSession session = LearningModelSession{…};
  • Bind inputs and output resources: LearningModelBinding binding = LearningModelBinding{session};
  • Evaluate model: LearningModelEvaluationResult results = session.EvaluateAsync(…).get();

ONNX Runtime APIs for session creation and inference

  • Set Session Options: Ort::SessionOptions session_options;
  • Allocate input and output Tensors: session.GetInputCount()  session.GetOutputCount()
  • Compile Model: Ort::Status compileStatus = Ort::CompileModel(env, compile_options);
  • Infer Model: session.Run()

Windows ML packaging and distribution

Windows ML strikes a balance between empowering IHVs to innovate and advance quickly with their implementations, while keeping serviceability and compatibility at the forefront. The Windows ML Runtime and IHV Execution Providers are packaged and distributed separately.

Qualcomm-image
  • The Windows ML Infrastructure class provides methods to download, configure, and register AI execution providers (EPs) for use with the ONNX Runtime.
  • IHV Execution providers (EPs) are packaged as separate MSIX components, for example, Qualcomm QNN EP. This design allows execution providers to be updated independently from the Windows ML Runtime components.

Windows ML general availability extensions to Olive for improved LLM Accuracy and Performance

Olive is a high level IHV agnostic framework used to download models from Hugging Face, quantize, optimize for a specific hardware, evaluate, check accuracy and measure performance latency. The ONNX model generated from Olive is optimized for execution on Qualcomm NPU using Windows ML with QNN EP. 

Olive provides both a CLI and a python interface with a JSON file as a config input to perform all the necessary graph transformations on the input onnx model. Ex: olive run –config resnet_ptq_qnn_qdq.json

Qualcomm-image

At general availability, Windows ML support for Snapdragon NPU comes with improved accuracy and performance improvements implemented in Olive and ortGenAI for popular LLM models.

With enhancements to the prompt window processing, KV cache population and token generation stages of the LLM pipeline in ONNX Runtime ORT v1.23, models like Phi3.5 provide higher accuracy.

Quantization techniques play a key role in accuracy. GPTQModel quantization and AI Model Efficiency Toolkit (AIMET) quantization has been enabled in Microsoft’s Olive toolchain to improve accuracy of LLMs on Snapdragon NPU.

  • GPTQ-Model is a production ready LLM model compression/quantization toolkit with hardware accelerated inference support for both CPU and GPU via Hugging Face Transformers, vLLM, and SGLang. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.

  • AIMET LPBQ quantization: AIMET is a software toolkit for quantizing trained ML models. AIMET improves the runtime performance of deep learning models by reducing compute load and memory footprint. Models quantized with AIMET facilitate its deployment on edge devices like mobile phones or laptops by reducing the memory footprint.

Performance improvement: By finetuning the inter-op and intra-op thread management options in Olive, the performance of several key LLMs has been improved within a close margin of native performance.

  • Intra-op and inter-op Thread optimization: These parameters control the number of threads used for parallel execution of operations within a single operation (intra-op) and between different operations (inter-op).

The table below lists the Olive recipes of Non GenAI and GenAI models that are included in Windows ML Certification Requirements. Developers can use these recipes to deploy these models optimized for Snapdragon NPUs.

Model Name

Hugging Face Link

Qualcomm Olive Recipe (NPU)

Model Card for CLIP ViT-B/32 - LAION-2B

  laion/CLIP-ViT-B-32-laion2B-s34B-  b79K · Hugging Face

  Olive/examples/vit at rel-0.9.2 · microsoft/Olive

OpenAI Model Card: CLIP 16x16

  openai/clip-vit-base-patch16 · Hugging Face

   Olive/examples/clip at rel-0.9.2 · microsoft/Olive

OpenAI Model Card: CLIP 32x32

 openai/clip-vit-base-patch32 · Hugging Face

   Olive/examples/clip at rel-0.9.2 · microsoft/Olive

Vision Transformer (base-sized model)

  google/vit-base-patch16-224 · Hugging Face

   Olive/examples/vit at rel-0.9.2 · microsoft/Olive

BERT multilingual base model (cased)

  google-bert/bert-base-multilingual-cased · Hugging Face

   Olive/examples/bert at rel-0.9.2 · microsoft/Olive

INT8 BERT base uncased finetuned MRPC

  Intel/bert-base-uncased-mrpc-int8-static-inc · Hugging Face

   Olive/examples/bert at rel-0.9.2 · microsoft/Olive

DeepSeek R1 Distill

   https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

  Olive/examples/deepseek at rel-0.9.2 · microsoft/Olive

Phi 3.5 mini instruct

    https://huggingface.co/microsoft/Phi-3.5-mini-instruct

  Olive/examples/phi3_5 at rel-0.9.2 · microsoft/Olive

Llama-3.2-1B-Instruct

   https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

   Olive/examples/llama3 at rel-0.9.2 · microsoft/Olive

Qwen 2.5 1.5B

  Qwen/Qwen2.5-1.5B-Instruct · Hugging Face

  Olive/examples/qwen2_5 at rel-0.9.2 · microsoft/Olive

Olive high level developer workflow

1.      Convert a HuggingFace model to ONNX

2.      Convert dynamic shape inputs to fixed dimension

3.      Quantize the model with OnnxStaticQuantizer (OnnxStaticQuantization).

4.      Generate ONNX model with embedded EP context (EPContextBinaryGenerator)

5.      Split a bigger model into smaller parts (CaptureSplitInfo/SplitModel)

6.      Graph transformation available in Olive graph surgeries (GraphSurgeries)

A combination of the following quantization techniques and graph surgeries were used to optimize the performance on Snapdragon:

1.      MatmulAddFusion: Fuses Matmul & Add op to GeMM operator, optimized for QNN

2.      ReplaceAttentionMask: Clips large negative attention mask values to -1e4 for quantization efficiency

3.      RemoveRopeMultiCache: Eliminates redundant RoPE computations across KV cache

4.      AttentionMaskToSequenceLengths: To control the context length at runtime

5.      SimplifiedLayerNormToL2Norm: Converts basic layer norm op to more efficient L2 norm

6.      Weight Rotation: Reduces outliers from weights and hidden states to enhance quantization efficiency.

7.      GPTQ: 4-bit per-channel symmetric quantization, reduces transformer layer size while preserving accuracy.

8.      ONNX Graph Capture: Exports the model to ONNX for further optimizations.

9.      4-bit Block wise quantization to the embedding layer & LM head.

10. AIMET quantization has been added in Olive to help achieve Native scaling.

Foundry Local: expanding AI capabilities

Foundry Local provides high-performance built-in models and supports custom models optimized for silicon performance. Combined with Windows ML and Azure AI Foundry, developers can integrate AI into Windows apps and services with minimal friction, while taking advantage of Qualcomm Technologies’ hardware acceleration.

Qualcomm-image

For general availability, Qualcomm Technologies is actively working with Microsoft on enabling popular models on Windows AI Foundry targeted for Windows ML on Snapdragon NPU with more models coming in the near future.

Qualcomm-image

Foundry Local CLI

Foundry Local Command Line Interface (CLI) provides simple to use interfaces to manage models hosted on the foundry.

View all available commands with the help option: foundry –help

The CLI organizes commands into three main categories:

  • Model: Commands for managing and running AI models
  • Service: Commands for controlling the Foundry Local service
  • Cache: Commands for managing your local model storage

For a comprehensive guide, refer to Foundry Local.

Windows AI Foundry with Agentic AI on Snapdragon NPU

As part of the Windows ML general availability readiness, Agentic AI applications using Windows AI Foundry have been validated on the Snapdragon NPU. Check out this tutorial from Microsoft  for developing a translation app with AI Foundry and Langchain.  This application has been validated using the deepseek-r1-distill-qwen-7b-qnn-npu model on Snapdragon NPU.

What’s next?

Beyond Windows ML general availability, Qualcomm Technologies is actively working on expanding the scope of Windows ML on Snapdragon NPU:

  • Expanding LLMs model coverage on Windows AI Foundry
  • Enable ISV applications on Windows ML
  • Improve performance and accuracy of LLMs for current and future architectures

References

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. AIMET is a product of Qualcomm Innovation Center, Inc.

About the Authors
Manuj Sabharwal
Manuj SabharwalLeads AI product strategy across Qualcomm’s software stack for Microsoft solutions.
Meghana Rao
Meghana Rao Staff Product Manager at Qualcomm
Gokul Tonpe
Gokul TonpePrincipal Engineer
Vinesh Sukumar
Vinesh SukumarVP, Product Management of AI/GenAI, Qualcomm Technologies, Inc.
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.