Back to All
Developer Blog

Windows AI Foundry & Windows ML on Qualcomm NPU

Windows AI Foundry lets you integrate AI using Window AI APIs powered by Windows built-in models. These models are optimized for ultra-low latency and high-performance on Windows Copilot+ PCs. Customize inbox models with your own local data by using Windows AI APIs like LoRA (low rank-adaption), semantic search and knowledge retrieval for LLMs. Foundry Local provides quick access to a rich set of ready-to-use OSS models.

Windows AI foundry
Windows AI Foundry diagram

Announced at Microsoft Build, Windows ML is a unified and high-performing local inference framework powered by ONNX Runtime Engine (ORT) with to-the-metal optimizations covering the breadth of IHV ecosystem. Windows ML automatically binds app to the best available silicon on client devices.

Key features of Windows ML listed below provide a good abstraction over ever changing silicon:

·      Handling the distribution of IHV dependencies such as compilers, runtimes and drivers for NPU.

·      Automatically binding models to the right silicon for execution.

·      Apps will no longer need to explicitly enumerate and bind to NPU/GPU/CPU providers. Apps will retain still the ability to explicitly specify a preferred execution target (e.g., execute on NPU only). Alternatively, apps can also leverage our smart device selection policies like high performance, high efficiency OR minimum power.

Starting early 2025, Qualcomm collaborated closely with Microsoft to develop and optimize Windows AI Foundry models on Qualcomm NPU using Windows ML and Qualcomm Neural Network Execution Provider (QNNEP).

The diagram below illustrates the building blocks of a Windows ML-based AI application. There are two aspects to this application:

·      Asynchronous Execution Provider downloading using Windows ML infrastructure API

·      AI inferencing using Windows ML APIs.

Sign up for Developer monthly newsletter

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Windows Runtime AI application
Figure 2. Windows AI Foundry application

One of the prerequisites for developing a Windows AI Foundry application is to download and install the Windows AI Foundry MSIX via Microsoft Store.

Windows AI Foundry API to download required EP packages

The below API sequence downloads the required IHV specific EP (Qualcomm NNEP for) binaries into the system.

  • winrt::Microsoft::Windows::AI::MachineLearning::Infrastructure infrastructure;
  • infrastructure.DownloadPackagesAsync().get();
  • auto executionProviders = infrastructure.LoadExecutionProvidersAsync().get();

Windows ML APIs for session creation and inference

  • Select Inference Device: LearningModelDeviceKind deviceKind = LearningModelDeviceKind::Default;

  • Set Session Options: LearningModelSessionOptions sessionOptions;

  • Create Session: LearningModelSession session = LearningModelSession{…};

  • Bind inputs and output resources: LearningModelBinding binding = LearningModelBinding{session};

  • Evaluate model: LearningModelEvaluationResult results = session.EvaluateAsync(…).get();

ONNX Runtme APIs for session creation and inference

  • Set Session Options: Ort::SessionOptions session_options;

  • Allocate input and output Tensors: session.GetInputCount()  session.GetOutputCount()

  • Compile Model: Ort::Status compileStatus = Ort::CompileModel(env, compile_options);

  • Infer Model: session.Run()

GenAI and LLM optimizations for Snapdragon using Microsoft Olive

Olive is high level IHV agnostic framework used to download model for Hugging Face, Quantize, run inference, check accuracy and latency etc. The ONNX model generated from Olive can be executed on Qualcomm NPU using ORT or WinML with Qualcomm NNEP.  Olive provides CLI and Python interface with a JSON file as a config input to perform all the necessary graph transformations on the input onnx model. Eg olive run –config resnet_ptq_qnn_qdq.json.

With the advent of Windows AI Foundry in Q1 2025, the AI engineering teams at Qualcomm Technologies have partnered closely with Microsoft to optimize several key GenAI and LLM models on Snapdragon NPU using Microsoft Olive toolchain. This section highlights the keys technical aspects of these optimizations to help developers optimize their models for Snapdragon.

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Olive Developer Workflow for Snapdragon
Figure 2. Olive Developer Workflow for Snapdragon

Olive Recipes for Qualcomm NPU

Model Name

Hugging Face Link

Qualcomm Olive Recipe

Model Card for CLIP ViT-B/32 - LAION-2B

laion/CLIP-ViT-B-32-laion2B-s34B-b79K · Hugging Face

Olive/examples/vit/qnn at main · microsoft/Olive

OpenAI Model Card: CLIP 32x32

openai/clip-vit-base-patch16 · Hugging Face

Olive/examples/clip at main · microsoft/Olive

OpenAI Model Card: CLIP 32x32

openai/clip-vit-base-patch32 · Hugging Face

Olive/examples/clip at main · microsoft/Olive

Vision Transformer (base-sized model)

google/vit-base-patch16-224 · Hugging Face

Olive/examples/vit/qnn at main · microsoft/Olive

BERT multilingual base model (cased)

google-bert/bert-base-multilingual-cased · Hugging Face

Olive/examples/bert/qnn at main · microsoft/Olive

INT8 BERT base uncased finetuned MRPC

Intel/bert-base-uncased-mrpc-int8-static-inc · Hugging Face

Olive/examples/bert/qnn at main · microsoft/Olive

DeepSeek R1 Distill

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Olive/examples/deepseek at main · microsoft/Olive

Phi 3.5 mini instruct

https://huggingface.co/microsoft/Phi-3.5-mini-instruct

Olive/examples/phi3_5 at main · microsoft/Olive

Llama-3.2-1B-Instruct

https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

Olive/examples/llama3/README.md at main · microsoft/Olive

Olive High level Developer Workflow

  1. Hugging Face to ONNX conversion
  2. Convert dynamic shape inputs to fixed dimension
  3. OnnxStaticQuantization: Quantize the model with OnnxStaticQuantizer.
  4. EPContextBinaryGenerator: Generate ONNX model with embedded EP context.
  5. CaptureSplitInfo/SplitModel: To split a bigger model into smaller parts
  6. GraphSurgeries: graph transformation available in Olive graph surgeries.

A combination of the following quantization techniques and graph surgeries were used to optimize the performance on Snapdragon

  1. MatmulAddFusion: Fuses Matmul & Add op to GeMM operator, optimized for qnn.
  2. ReplaceAttentionMask: Clips large negative attention mask values to -1e4 for quantization efficiency
  3. RemoveRopeMultiCache: Eliminates redundant RoPE computations across KV cache
  4. AttentionMaskToSequenceLengths: To control the context length at runtime
  5. SimplifiedLayerNormToL2Norm: Converts basic layer norm op to more efficient L2 norm
  6. Weight Rotation: Reduces outliers from weights and hidden states to enhance quantization efficiency.
  7. GPTQ: 4-bit per-channel symmetric quantization, reduces transformer layer size while preserving accuracy.
  8. ONNX Graph Capture: Exports the model to ONNX for further optimizations.
  9. 4-bit Block wise quantization to the embedding layer & LM head.




References

  1. What is Azure AI Foundry? - Azure AI Foundry | Microsoft Learn
  2. What is Windows Copilot Runtime?
  3.  microsoft/Olive: Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
  4. Introduction to Windows Machine Learning | Microsoft Learn
  5. Qualcomm AI Engine Direct Execution provider on ONNX Runtime website

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Author
Gokul Tonpe
Gokul TonpePrincipal Engineer
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.