Back to All
Developer Blog

How to run DeepSeek models on Windows on Snapdragon – Llama.cpp and MLC-LLM tutorial

Sign up for Developer monthly newsletter-image

Sign up for Developer monthly newsletter

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Sign up
Come for support, stay for the community-image

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Join Developer Discord

Co-written with Li He, Srinivasa Deevi, Hongqiang Wang, Siva Rama Krishna Reddy B.

DeepSeek-R1 is an open-source reasoning model developed by DeepSeek to handle tasks requiring logical inference, mathematical problem-solving and real-time decision making. One of its standout features is the ability to trace its logic, which makes it easier to understand and, if necessary, challenge its output.

This transparency is particularly valuable in fields where explainable outcomes are crucial, such as research and complex decision making.

AI distillation is a process that creates smaller, more efficient models from larger ones, retaining much of their reasoning power while reducing computational demands. DeepSeek has applied this technique to develop a suite of distilled models from R1, using Qwen and Llama architectures. That allows users to take advantage of the capabilities of DeepSeek-R1 on standard laptops.

Developers can pick a few options to run the models on Windows on Snapdragon CPU.

The first one through Llama.cpp - an open-source project developed to implement Meta’s LLaMa architecture efficiently in C/C++.

The second option is through popular LLM platforms like LMStudio. Such LLM platforms can help to quickly try out different DeepSeek model variants without any additional changes on Snapdragon X Series laptops.

Similarly, to run the models on Windows on Snapdragon GPU, we offer 2 options: either through Llama.cpp, with relevant Adreno library packages and command parameters necessary or through MLC-LLM (an advanced machine learning compiler and high-performance deployment engine tailored for large language models).

This tutorial shows you how to run DeepSeek-R1 models on Windows on Snapdragon CPU and GPU using Llama.cpp and MLC-LLM. You can run the steps below on Snapdragon X Series laptops.

Running on CPU – Llama.cpp how to guide

You can use Llama.cpp to run DeepSeek on the CPU of devices powered by the Snapdragon X Series. See below for instructions on using llama.cpp to run on GPU.

  1. Navigate to https://github.com/ggerganov/llama.cpp/releases/latest.
  2. Under Assets, look for llama-bxxxx-bin-win-llvm-arm64.zip, where xxxx are numerals; for example, llama-b4601-bin-win-llvm-arm64.zip. Download the zip file and extract the content.
  3. Download a quantized version of the DeepSeek distilled models. The examples below use Q4_0.
  4. Run the model with llama-cli, using the following command:
.\llama-cli.exe -m ..\ggml-model-deepseek-r1-distill-qwen-1.5b-Q4_0_pure.gguf -no-cnv -b 128 -ngl 0 -c 2048 -p "Hello"
  1. Alternatively, benchmark all three models – 1.5b, 7b and 8b – using llama-bench:
.\llama-bench.exe -b 128 -ngl 0 -p 256 -n 100 -m ..\ggml-model-deepseek-r1-distill-qwen-1.5b-Q4_0_pure.gguf -m ..\ggml-model-deepseek-r1-distill-qwen-7b-Q4_0.gguf -m ..\ggml-model-deepseek-r1-distill-llama-8b-Q4_0.gguf

Running on GPU – Llama.cpp how to guide

See above for instructions on using Llama.cpp to run on CPU.

  1. Navigate to https://github.com/ggerganov/llama.cpp/releases/latest.
  2. Under Assets, look for llama-bxxxx-bin-win-llvm-arm64-opencl-adreno.zip, where xxxx are numerals; for example, llama-b4601-bin-win-llvm-arm64-opencl-adreno.zip. Download the zip file and extract the content.
  3. Prepare DeepSeek distilled models in pure Q4_0 format. (Note on model preparation: Q4_0 models downloaded from Hugging Face usually contain Q6_K quantization, which will run slower on Adreno GPU. To achieve best performance, models should be quantized using pure Q4_0. To get pure Q4_0 models, add --pure to llama-quantize.)
  4. Run the model with llama-cli, using the following command:
.\llama-cli.exe -m ..\ggml-model-deepseek-r1-distill-qwen-1.5b-Q4_0_pure.gguf -no-cnv -b 128 -ngl 99 -c 2048 -p "Hello"
  1. Alternatively, benchmark all three models – 1.5b, 7b and 8b – using llama-bench:
.\llama-bench.exe -b 128 -ngl 99 -p 256 -n 100 -m ..\ggml-model-deepseek-r1-distill-qwen-1.5b-Q4_0_pure.gguf -m ..\ggml-model-deepseek-r1-distill-qwen-7b-Q4_0.gguf -m ..\ggml-model-deepseek-r1-distill-llama-8b-Q4_0.gguf

Running on GPU – MLC-LLM how to guide

  1. Set up the host

    Install Anaconda from https://docs.anaconda.com/anaconda/install/windows/
conda create -n mlc-venv -c conda-forge "llvmdev=15" "cmake>=3.24" git rust numpy==1.26.4 decorator psutil typing_extensions scipy attrs git-lfs python=3.12 onnx clang_win-64
conda activate mlc-venv

For additional dependencies, run the following command:

pip install torch==2.2.0 torchvision==0.18.0 torchaudio==2.3.0

Compiling the TVM module requires a GCC compiler later than version 7.1. You may install it from MinGW Distro, available at nuwen.net. Add the path to the compiler explicitly to your system PATH.

Run on the Windows target.

  1. Download packages.

    MLC-LLM for Adreno is simplified by building a python wheel with an Adreno-specific compile configuration, pre-compiled target binaries and tools. All the required installers are available in the releases in JFrog.io. Download the following packages for Windows environments:
mlc_llm-utils-win-x86-01_31_2025.zip
mlc_llm_adreno_cpu_01_31_2025-0.1.dev0-cp312-cp312-win_amd64.whl
tvm_adreno_cpu_01_31_2025-0.19.dev0-cp312-cp312-win_amd64.whl
  1. Install the packages.

    Under the Anaconda environment you configured above, set up as shown below:
pip install tvm_adreno_cpu_01_31_2025-0.19.dev0-cp312-cp312-win_amd64.whl
pip install mlc_llm_adreno_cpu_01_31_2025-0.1.dev0-cp312-cp312-win_amd64.whl

Check the installation status as below:

python -c "import tvm; print(tvm.__path__)"\
python -c "import mlc_llm; print(mlc_llm.__path__)"

Download the utils and extract as below:

mlc_llm-utils-win-x86-01_31_2025
  bin
  ├── mlc_cli_chat.exe
  ├── mlc_llm.dll
  ├── mlc_llm_module.dll
  └── tvm_runtime.dll
  1. Compile the DeepSeek model. 

    Given a DeepSeek-R1-Distill-Qwen-1.5B located in the folder dist/models/Meta-Llama-3-8B-Instruct, the compilation process follows the stages described below:
    • Generate configuration
# Generate Config
python -m  mlc_llm gen_config \
     ./dist/models/DeepSeek-R1-Distill-Qwen-1.5B \
     --quantization q4f16_0 \
     --conv-template  deepseek_r1_qwen \
     --prefill-chunk-size 256  \
     -o ./dist/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-MLC
  • Quantize parameters
# Quantize Parameters
python3 -m mlc_llm convert_weight \
        ./dist/models/DeepSeek-R1-Distill-Qwen-1.5B \
        --quantization q4f16_0 \
        -o ./dist/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-MLC
  • Compile model
python3 -m mlc_llm compile \
        ./dist/DeepSeek-R1-Distill-Qwen-1.5Bq4f16_0-MLC/mlc-chat-config.json \
        --device  windows:adreno_x86 \
        -o ./dist/libs/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-adreno.dll

The artifacts you need to pick here are the quantized weights located at ./dist/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-MLC and the model library located at ./dist/libs/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-adreno.dll.

  1. Run on the Adreno GPU target.

    Windows supports both a native CLI approach and the Python way of running the compiled model.
  • Native CLI

Use the precompiled target CLI utils in mlc_llm-utils-win-x86-01_31_2025.tar.bz2. There is the CLI tool mlc_cli_chat.exe and its dependencies, as listed below.

mlc_llm-utils-win-x86-01_31_2025
  bin
  ├── mlc_cli_chat.exe
  ├── mlc_llm.dll
  ├── mlc_llm_module.dll
  └── tvm_runtime.dll

Now, copy build artifacts dist/DeepSeek-R1-Distill-Qwen-1.5B and dist/libs/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-adreno.dll to target device.

Launch a terminal on the target machine and run the following command. It will start an interactive chat:

mlc_cli_chat.exe --model <PATH to DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-MLC> --model-lib <PATH to DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-adreno.dll> --device opencl
  • Python CLI

Under Anaconda shell with the environment mlc-venv, execute the following command:

python -m mlc_llm chat --device opencl --model-lib <PATH to DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-adreno.dll> <PATH to DeepSeek-R1-Distill-Qwen-1.5B-q4f16_0-MLC>

Next steps

Now it’s your turn. Our engineers have put together those tutorials so you can experiment with DeepSeek-R1 models on CPU and GPU. Refer to the tutorial to learn how to run DeepSeek-R1 models on Windows on Snapdragon GPU using Ollama and how to run DeepSeek R1 models on LM Studio tutorial..

Meanwhile, Microsoft is bringing NPU-optimized versions of DeepSeek-R1 directly to Copilot+ PCs, starting with Qualcomm Snapdragon X Series devices.

The company also announced that the distilled DeepSeek R1 models, optimized using ONNX, are now available on Snapdragon-powered Copilot+ PCs. These models offer a time to first token of less than 70 ms for short prompts (<64 tokens) and a throughput rate of 25-40 tokens/s, with longer responses achieving higher throughput. Get started today by downloading the AI Toolkit extension in VS Code.

 

Want to find out more about DeepSeek on Windows on Snapdragon? Join our Developer Discord for more insights and real-time conversations with fellow developers and our internal experts.

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Qualcomm-branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Authors
Devang Aggarwal
Devang AggarwalProduct Manager, Senior
Dileep Karpur
Dileep Karpur
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.