Developer Blog

Big Performance Boost for llama.cpp and chatglm.cpp with Windows on Snapdragon

Written by

Changgeng Zhong

Apr 12, 2024

performance boost with llama.cpp and chatglm.cpp

If you’ve been developing AI apps for Windows on Snapdragon, you may have found sub-par performance with llama.cpp and chatglm.cpp. In fact, using build commands from open-source webpages, our own Qualcomm Technologies’ engineers and partner OEMs have seen as few as 3 tokens per second processed on Windows on Snapdragon devices.

But now you can accelerate that processing by enabling two features – NEON and FMA_ARM – and using either of the two new build commands for LLVM-MinGW and MSVC. In fact, Qualcomm Technologies’ engineers have seen dramatic improvements in the resulting performance, with up to 20 tokens per second processed on a device with Snapdragon X Elite Compute Platform.

In this post, you’ll see how to build llama.cpp and chatglm.cpp with the LLVM-MinGW and MSVC commands to improve performance.

Download code for llama and chatglm

llama.cpp

llama.cpp is designed to run Meta's GPT-3-class large language model (LLM) – known as LLaMA – on local devices, including those powered by Windows on Snapdragon. It enables high performance on a wide variety of hardware with minimal setup.

chatglm.cpp

ChatGLM is an open, bilingual language model optimized for Chinese conversation, based on the General Language Model architecture. chatglm.cpp enables real-time inference on laptops such as those powered by Windows on Snapdragon. It is accelerated by quantization, in a way that is similar to llama.cpp.

Once you have the code for the apps, you’ll follow different procedures depending on whether you build with LLVM-MinGW or Microsoft Visual C/C++ (MSVC).

Build llama.cpp and chatglm.cpp (using LLVM-MinGW)

MinGW is a native port of open-source GCC that allows C programming on Windows. You can use its header files and import libraries to build native Windows applications. Here is how to use LLVM-MinGW to enable the NEON and FMA_ARM features:

1. Download and install the tools.

Download the LLVM-MinGW compiler toolchain
Unzip the compressed file to a folder, then add that path to your Windows system environment variables.
Download the latest CMake Windows ARM64 installer and install it.

2. Note that there is no need to apply any patches for llama.cpp and chatglm.cpp.

3. From the Windows command prompt, go to the code path named “llama.cpp” or “chatglm.cpp” and execute the following commands to build the binary files:

mkdir build
cd build
cmake .. -G "MinGW Makefiles"
cmake --build . --config Release

Build llama.cpp and chatglm.cpp (using MSVC)

You can also use MSVC to build high-performance binary files and enable the NEON and FMA_ARM features.

1. Download and install the tools.

Download the latest CMake Windows ARM64 installer from https://cmake.org/download/ and install it.
Download Visual Studio 2022 and install it.

2. For llama.cpp, there is no need to apply a patch, but for chatglm.cpp, you must apply the following patch:

C:\chatglm.cpp\third_party\ggml>git diff
diff --git a/include/ggml/ggml.h b/include/ggml/ggml.h
index 4b16032..eaa6c79 100644
--- a/include/ggml/ggml.h
+++ b/include/ggml/ggml.h
@@ -286,7 +286,7 @@ extern "C" {

 #if defined(__ARM_NEON) && defined(__CUDACC__)
     typedef half ggml_fp16_t;
-#elif defined(__ARM_NEON)
+#elif defined(__ARM_NEON) && !defined(_MSC_VER)
     typedef __fp16 ggml_fp16_t;
 #else
     typedef uint16_t ggml_fp16_t;
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index b225597..d1a1d66 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -41,8 +41,14 @@ endif()

 if (${CMAKE_SYSTEM_NAME} STREQUAL "Emscripten")
     message(STATUS "Emscripten detected")
-elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm" OR ${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
+elseif ((${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm") OR (${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64") OR (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ARM64"))
     message(STATUS "ARM detected")
+       if (MSVC)
+           add_compile_definitions(__ARM_NEON)
+           add_compile_definitions(__ARM_FEATURE_FMA)
+           add_compile_definitions(__ARM_FEATURE_DOTPROD)
+           add_compile_definitions(__aarch64__) # MSVC defines _M_ARM64 instead
+       endif()
     #set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mcpu=apple-m1")
 elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64le" OR ${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64")
     message(STATUS "PPC64 detected"

3. From the Windows command prompt, go to the code path named “llama.cpp” or “chatglm.cpp” and execute the following commands to build the binary files:

mkdir build
cd build
cmake .. -A ARM64
cmake --build . --config Release

Now it’s your turn to build

With those easily accessible tools and easy-to-follow instructions, you should see a substantial boost in performance.

Explore our new Windows on Snapdragon page - we’ve made it easier to get your hands on the resources to develop for Windows on Snapdragon and to port existing apps. That includes developer tools, full documentation, support options and our developer blog.

AI Machine Learning Windows on Snapdragon

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Author

Changgeng Zhong Staff Engineer