Windows AI Foundry lets you integrate AI using Window AI APIs powered by Windows built-in models. These models are optimized for ultra-low latency and high-performance on Windows Copilot+ PCs. Customize inbox models with your own local data by using Windows AI APIs like LoRA (low rank-adaption), semantic search and knowledge retrieval for LLMs. Foundry Local provides quick access to a rich set of ready-to-use OSS models.
Announced at Microsoft Build, Windows ML is a unified and high-performing local inference framework powered by ONNX Runtime Engine (ORT) with to-the-metal optimizations covering the breadth of IHV ecosystem. Windows ML automatically binds app to the best available silicon on client devices.
Key features of Windows ML listed below provide a good abstraction over ever changing silicon:
· Handling the distribution of IHV dependencies such as compilers, runtimes and drivers for NPU.
· Automatically binding models to the right silicon for execution.
· Apps will no longer need to explicitly enumerate and bind to NPU/GPU/CPU providers. Apps will retain still the ability to explicitly specify a preferred execution target (e.g., execute on NPU only). Alternatively, apps can also leverage our smart device selection policies like high performance, high efficiency OR minimum power.
Starting early 2025, Qualcomm collaborated closely with Microsoft to develop and optimize Windows AI Foundry models on Qualcomm NPU using Windows ML and Qualcomm Neural Network Execution Provider (QNNEP).
The diagram below illustrates the building blocks of a Windows ML-based AI application. There are two aspects to this application:
· Asynchronous Execution Provider downloading using Windows ML infrastructure API
· AI inferencing using Windows ML APIs.
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
One of the prerequisites for developing a Windows AI Foundry application is to download and install the Windows AI Foundry MSIX via Microsoft Store.
Windows AI Foundry API to download required EP packages
The below API sequence downloads the required IHV specific EP (Qualcomm NNEP for) binaries into the system.
- winrt::Microsoft::Windows::AI::MachineLearning::Infrastructure infrastructure;
- infrastructure.DownloadPackagesAsync().get();
- auto executionProviders = infrastructure.LoadExecutionProvidersAsync().get();
Windows ML APIs for session creation and inference
- Select Inference Device: LearningModelDeviceKind deviceKind = LearningModelDeviceKind::Default;
- Set Session Options: LearningModelSessionOptions sessionOptions;
- Create Session: LearningModelSession session = LearningModelSession{…};
- Bind inputs and output resources: LearningModelBinding binding = LearningModelBinding{session};
- Evaluate model: LearningModelEvaluationResult results = session.EvaluateAsync(…).get();
ONNX Runtme APIs for session creation and inference
- Set Session Options: Ort::SessionOptions session_options;
- Allocate input and output Tensors: session.GetInputCount() session.GetOutputCount()
- Compile Model: Ort::Status compileStatus = Ort::CompileModel(env, compile_options);
- Infer Model: session.Run()
GenAI and LLM optimizations for Snapdragon using Microsoft Olive
Olive is high level IHV agnostic framework used to download model for Hugging Face, Quantize, run inference, check accuracy and latency etc. The ONNX model generated from Olive can be executed on Qualcomm NPU using ORT or WinML with Qualcomm NNEP. Olive provides CLI and Python interface with a JSON file as a config input to perform all the necessary graph transformations on the input onnx model. Eg olive run –config resnet_ptq_qnn_qdq.json.
With the advent of Windows AI Foundry in Q1 2025, the AI engineering teams at Qualcomm Technologies have partnered closely with Microsoft to optimize several key GenAI and LLM models on Snapdragon NPU using Microsoft Olive toolchain. This section highlights the keys technical aspects of these optimizations to help developers optimize their models for Snapdragon.
Come for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.
Olive Recipes for Qualcomm NPU
|
Model Name |
Hugging Face Link |
Qualcomm Olive Recipe |
|
Model Card for CLIP ViT-B/32 - LAION-2B |
||
|
OpenAI Model Card: CLIP 32x32 |
||
|
OpenAI Model Card: CLIP 32x32 |
||
|
Vision Transformer (base-sized model) |
||
|
BERT multilingual base model (cased) |
||
| INT8 BERT base uncased finetuned MRPC | ||
|
DeepSeek R1 Distill |
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
Phi 3.5 mini instruct |
||
|
Llama-3.2-1B-Instruct |
Olive High level Developer Workflow
- Hugging Face to ONNX conversion
- Convert dynamic shape inputs to fixed dimension
- OnnxStaticQuantization: Quantize the model with OnnxStaticQuantizer.
- EPContextBinaryGenerator: Generate ONNX model with embedded EP context.
- CaptureSplitInfo/SplitModel: To split a bigger model into smaller parts
- GraphSurgeries: graph transformation available in Olive graph surgeries.
A combination of the following quantization techniques and graph surgeries were used to optimize the performance on Snapdragon
- MatmulAddFusion: Fuses Matmul & Add op to GeMM operator, optimized for qnn.
- ReplaceAttentionMask: Clips large negative attention mask values to -1e4 for quantization efficiency
- RemoveRopeMultiCache: Eliminates redundant RoPE computations across KV cache
- AttentionMaskToSequenceLengths: To control the context length at runtime
- SimplifiedLayerNormToL2Norm: Converts basic layer norm op to more efficient L2 norm
- Weight Rotation: Reduces outliers from weights and hidden states to enhance quantization efficiency.
- GPTQ: 4-bit per-channel symmetric quantization, reduces transformer layer size while preserving accuracy.
- ONNX Graph Capture: Exports the model to ONNX for further optimizations.
- 4-bit Block wise quantization to the embedding layer & LM head.
References
- What is Azure AI Foundry? - Azure AI Foundry | Microsoft Learn
- What is Windows Copilot Runtime?
- microsoft/Olive: Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
- Introduction to Windows Machine Learning | Microsoft Learn
- Qualcomm AI Engine Direct Execution provider on ONNX Runtime website

