Developer Blog

Windows AI Foundry & Windows ML on Qualcomm NPU

Written by

Gokul Tonpe

May 19, 2025

Windows AI Foundry lets you integrate AI using Window AI APIs powered by Windows built-in models. These models are optimized for ultra-low latency and high-performance on Windows Copilot+ PCs. Customize inbox models with your own local data by using Windows AI APIs like LoRA (low rank-adaption), semantic search and knowledge retrieval for LLMs. Foundry Local provides quick access to a rich set of ready-to-use OSS models.

Windows AI Foundry diagram

Announced at Microsoft Build, Windows ML is a unified and high-performing local inference framework powered by ONNX Runtime Engine (ORT) with to-the-metal optimizations covering the breadth of IHV ecosystem. Windows ML automatically binds app to the best available silicon on client devices.

Key features of Windows ML listed below provide a good abstraction over ever changing silicon:

· Handling the distribution of IHV dependencies such as compilers, runtimes and drivers for NPU.

· Automatically binding models to the right silicon for execution.

· Apps will no longer need to explicitly enumerate and bind to NPU/GPU/CPU providers. Apps will retain still the ability to explicitly specify a preferred execution target (e.g., execute on NPU only). Alternatively, apps can also leverage our smart device selection policies like high performance, high efficiency OR minimum power.

Starting early 2025, Qualcomm collaborated closely with Microsoft to develop and optimize Windows AI Foundry models on Qualcomm NPU using Windows ML and Qualcomm Neural Network Execution Provider (QNNEP).

The diagram below illustrates the building blocks of a Windows ML-based AI application. There are two aspects to this application:

· Asynchronous Execution Provider downloading using Windows ML infrastructure API

· AI inferencing using Windows ML APIs.

Sign up for Developer monthly newsletter

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Figure 2. Windows AI Foundry application

One of the prerequisites for developing a Windows AI Foundry application is to download and install the Windows AI Foundry MSIX via Microsoft Store.

Windows AI Foundry API to download required EP packages

The below API sequence downloads the required IHV specific EP (Qualcomm NNEP for) binaries into the system.

winrt::Microsoft::Windows::AI::MachineLearning::Infrastructure infrastructure;
infrastructure.DownloadPackagesAsync().get();
auto executionProviders = infrastructure.LoadExecutionProvidersAsync().get();

Windows ML APIs for session creation and inference

Select Inference Device: LearningModelDeviceKind deviceKind = LearningModelDeviceKind::Default;
Set Session Options: LearningModelSessionOptions sessionOptions;
Create Session: LearningModelSession session = LearningModelSession{…};
Bind inputs and output resources: LearningModelBinding binding = LearningModelBinding{session};
Evaluate model: LearningModelEvaluationResult results = session.EvaluateAsync(…).get();

ONNX Runtme APIs for session creation and inference

Set Session Options: Ort::SessionOptions session_options;
Allocate input and output Tensors: session.GetInputCount() session.GetOutputCount()
Compile Model: Ort::Status compileStatus = Ort::CompileModel(env, compile_options);
Infer Model: session.Run()

GenAI and LLM optimizations for Snapdragon using Microsoft Olive

Olive is high level IHV agnostic framework used to download model for Hugging Face, Quantize, run inference, check accuracy and latency etc. The ONNX model generated from Olive can be executed on Qualcomm NPU using ORT or WinML with Qualcomm NNEP. Olive provides CLI and Python interface with a JSON file as a config input to perform all the necessary graph transformations on the input onnx model. Eg olive run –config resnet_ptq_qnn_qdq.json.

With the advent of Windows AI Foundry in Q1 2025, the AI engineering teams at Qualcomm Technologies have partnered closely with Microsoft to optimize several key GenAI and LLM models on Snapdragon NPU using Microsoft Olive toolchain. This section highlights the keys technical aspects of these optimizations to help developers optimize their models for Snapdragon.

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Join Developer Discord

Figure 2. Olive Developer Workflow for Snapdragon

Olive Recipes for Qualcomm NPU

Model Name	Hugging Face Link	Qualcomm Olive Recipe
Model Card for CLIP ViT-B/32 - LAION-2B	laion/CLIP-ViT-B-32-laion2B-s34B-b79K · Hugging Face	Olive/examples/vit/qnn at main · microsoft/Olive
OpenAI Model Card: CLIP 32x32	openai/clip-vit-base-patch16 · Hugging Face	Olive/examples/clip at main · microsoft/Olive
OpenAI Model Card: CLIP 32x32	openai/clip-vit-base-patch32 · Hugging Face	Olive/examples/clip at main · microsoft/Olive
Vision Transformer (base-sized model)	google/vit-base-patch16-224 · Hugging Face	Olive/examples/vit/qnn at main · microsoft/Olive
BERT multilingual base model (cased)	google-bert/bert-base-multilingual-cased · Hugging Face	Olive/examples/bert/qnn at main · microsoft/Olive
INT8 BERT base uncased finetuned MRPC	Intel/bert-base-uncased-mrpc-int8-static-inc · Hugging Face	Olive/examples/bert/qnn at main · microsoft/Olive
DeepSeek R1 Distill	https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	Olive/examples/deepseek at main · microsoft/Olive
Phi 3.5 mini instruct	https://huggingface.co/microsoft/Phi-3.5-mini-instruct	Olive/examples/phi3_5 at main · microsoft/Olive
Llama-3.2-1B-Instruct	https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct	Olive/examples/llama3/README.md at main · microsoft/Olive

Olive High level Developer Workflow

Hugging Face to ONNX conversion
Convert dynamic shape inputs to fixed dimension
OnnxStaticQuantization: Quantize the model with OnnxStaticQuantizer.
EPContextBinaryGenerator: Generate ONNX model with embedded EP context.
CaptureSplitInfo/SplitModel: To split a bigger model into smaller parts
GraphSurgeries: graph transformation available in Olive graph surgeries.

A combination of the following quantization techniques and graph surgeries were used to optimize the performance on Snapdragon

MatmulAddFusion: Fuses Matmul & Add op to GeMM operator, optimized for qnn.
ReplaceAttentionMask: Clips large negative attention mask values to -1e4 for quantization efficiency
RemoveRopeMultiCache: Eliminates redundant RoPE computations across KV cache
AttentionMaskToSequenceLengths: To control the context length at runtime
SimplifiedLayerNormToL2Norm: Converts basic layer norm op to more efficient L2 norm
Weight Rotation: Reduces outliers from weights and hidden states to enhance quantization efficiency.
GPTQ: 4-bit per-channel symmetric quantization, reduces transformer layer size while preserving accuracy.
ONNX Graph Capture: Exports the model to ONNX for further optimizations.
4-bit Block wise quantization to the embedding layer & LM head.

References

Compute Windows on Snapdragon

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Author

Gokul TonpePrincipal Engineer