Back to All
Developer Blog

Build an NPU-accelerated Edge Chatbot for PCs powered by Snapdragon

Sign up for Developer monthly newsletter-image

Sign up for Developer monthly newsletter

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Sign up
Come for support, stay for the community-image

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Join Developer Discord
TLDR: 
This walkthrough provides everything you need to build your own chat app on a PC powered by Snapdragon using AnythingLLM. We describe the setup and testing process, plus a few recommendations for opportunities to expand on the app’s baseline capabilities.


Even if AI isn’t your usual focus area, it’s worth learning to work with and develop with AI models. Core AI models continue to grow in analytical power and capability, and the AI market is expanding with new, task-specific models.

Knowing the basics of how to stand up, test, and modify an AI-powered application is a way to future-proof your skill set. This exercise is perfect for people who are new to developing with AI, or who are looking for a small, fun project that provides opportunities to experiment and play with a simple edge AI setup.

What You’ll Need

  • Hardware:

    This demonstration was built using the following hardware. The app itself is designed to be hardware agnostic, but you may see differences in performance depending on what hardware you choose. Make sure that you have enough RAM to support local inference. AnythingLLM is very lightweight, and it’s possible to use basic features and store chats with as little as 2GB of RAM


    • Machine: Dell Latitude 7455
    • Chip: Snapdragon X Elite
    • OS: Windows 11
    • Memory: 32 GB
       
  • Software:

  • Python Version: 3.12.6

  • AnythingLLM LLM Provider: AnythingLLM NPU (for older models, this may be listed as Qualcomm QNN)

    AnythingLLM Chat Model: Llama 3.1 8B Chat 8K

  • Other resources:

    Check out this GitHub repository for additional resources and code.

Setup

  • Install and set up AnythingLLM. Make sure you choose AnythingLLM NPU when prompted to choose an LLM provider to target the NPU. Choose a model; we used Llama 3.1 8B Chat with 8K context, but you may get better performance from other models depending on the constraints of your hardware.

  • Create a workspace by clicking "+ New Workspace"

  • Generate an API key
  1. Click the settings button on the bottom of the left panel
  2. Open the "Tools" dropdown
  3. Click "Developer API"
  4. Click "Generate New API Key"

Open a PowerShell instance and clone the repo

git clone https://github.com/thatrandomfrenchdude/simple-npu-chatbot.git

Create and activate your virtual environment with reqs

# 1. navigate to the cloned directory
cd simple-npu-chatbot

# 2. create the python virtual environment
python -m venv llm-venv

# 3. activate the virtual environment
./llm-venv/Scripts/Activate.ps1     # windows
source \llm-venv\bin\activate       # mac/linux

# 4. install the requirements
pip install -r requirements.txt

Create your config.yaml file with the following variables

api_key: "your-key-here"
model_server_base_url: "http://localhost:3001/api/v1"
workspace_slug: "your-slug-here"
stream: true
stream_timeout: 60

Test the model server auth to verify the API key

python src/auth.py

Get your workspace slug using the workspaces tool

  1. Run python src/workspaces.py in your command line console
  2. Find your workspace and its slug from the output
  3. Add the slug to the workspace_slug variable in config.yaml

Building the App

With the application configured, it is worth reviewing the code in depth so that you are well equipped to extend it for your own application. As always with code, there are always many ways to do the same thing, so don’t take this as the only way to build a chatbot application.

In addition to the auth.py and workspaces.py utilities mentioned above, this code contains the option to use either the terminal or a Gradio interface to talk with the Chatbot. A terminal is quicker to set up and very lightweight—especially useful if you’re experimenting with tight device constraints—but the user interface is limited. If you haven’t used terminal before, you might find it a bit counterintuitive.

If you choose Gradio, you’ll need to have access to a web browser; that means you’ll be using slightly more system resources than you would with Terminal, but you’ll enjoy a more intuitive user interface.

In addition, both interfaces include a blocking and streaming version set by the streaming Boolean in config.yaml. Blocking waits until the response is fully processed to return it, and streaming returns chunks as they are available.

This code review will only cover the terminal interface since the functionality is similar. Please note, the code is simplified as compared to the GitHub repository for brevity; types and comments are removed.

Terminal Chatbot

To start with the terminal version, you will need the following libraries installed and imported:

import asyncio
import httpx
import json
import requests
import sys
import threading
import time
import yaml

Asyncio, httpx, and requests are used to handle the asynchronous streaming requests to the model server, json and yaml are used to engage with the request responses and config file respectively, and sys, threading, and time are used for the progress bar while the blocking response is processing.

The loading_indicator function is simple: every half second it prints a period in the command line, up to 10, before erasing the line and starting again. This is run in a thread while the model response is processing, and the code looks like this:

def loading_indicator():
    while not stop_loading:
        for _ in range(10):
            sys.stdout.write('.')
            sys.stdout.flush()
            time.sleep(0.5)
        sys.stdout.write('\r' + ' ' * 10 + '\r')
        sys.stdout.flush()
    print('')

The rest of the code, with exception of the invocation discussed at the end, exists in the Chatbot class. The initialization is straightforward: the config file is read, and class variables are assigned for references in the other functions:

class Chatbot:
    def __init__(self):
        with open("config.yaml", "r") as file:
            config = yaml.safe_load(file)

        self.api_key = config["api_key"]
        self.base_url = config["model_server_base_url"]
        self.stream = config["stream"]
        self.stream_timeout = config["stream_timeout"]
        self.workspace_slug = config["workspace_slug"]

        if self.stream:
            self.chat_url = f"{self.base_url}/workspace/{self.workspace_slug}/stream-chat"
        else:
            self.chat_url = f"{self.base_url}/workspace/{self.workspace_slug}/chat"

        self.headers = {
            "accept": "application/json",
            "Content-Type": "application/json",
            "Authorization": "Bearer " + self.api_key
        }

Take special note of the last few lines of code. There is a check as to whether the user wants to stream, and request headers are defined once for easy reuse with each one.

Following the initialization is the run function, which can be considered as the core function of all chatbots:

def run(self):
        while True:
            user_message = input("You: ")
            if user_message.lower() in [
                "exit",
                "quit",
                "bye",
            ]:
                break
            print("")
            try:
                self.streaming_chat(user_message) if self.stream \
                    else self.blocking_chat(user_message)
            except Exception as e:
                print("Error! Check the model is correctly loaded. More details in README troubleshooting section.")
                sys.exit(f"Error details: {e}")

On each loop, the function first requests input from the user. The input is checked to know whether the user wants to break out of the chat using special keywords assigned for this purpose. Finally, a response is provided. Wash, rinse, repeat ad infinitum.

Custom, hardcoded functionality can be included by either adding additional keywords or inserting additional code around or between the user input and chatbot response. Agents, a specialized implementation of chatbots capable of independent actions, may include extensive processing before responding to the user. For a simple agent implementation, check out this GitHub repository.

Now let’s dive into the blocking chat functionality.

The first four lines of code deal with the thread for the loading bar. The stop_loading global variable mentioned earlier is referenced, set to False since we want the loading to start, and then a thread is launched with the function. That looks like this:

        global stop_loading
        stop_loading = False
        loading_thread = threading.Thread(target=loading_indicator)
        loading_thread.start()

Next, data and chat_response are defined. Data contains a json dictionary to be passed to the API when the request is made. The sessionId key in the dictionary is not used by AnythingLLM; it is there to keep track of the session(s) if you want to do that as the developer. The attachments key is included since AnythingLLM could accept and parse attachments with a chat completion request. Chat_response contains the results of the request made using the standard requests library in Python.

        data = {
            "message": message,
            "mode": "chat",
            "sessionId": "example-session-id",
            "attachments": []
        }

        chat_response = requests.post(
            self.chat_url,
            headers=self.headers,
            json=data
        )

The last section of the function handles the agent's response. First, the stop_loading variable is triggered, and the loading bar thread is joined back into the main thread. Then, the function attempts to print the message while checking for errors. As with the run function, this can be extended with custom functionality. For example, though AnythingLLM already tracks chat history for context, you may want to track it separately within your application for additional context with each request.

        stop_loading = True
        loading_thread.join()

        try:
            print("Agent: ", end="")
            print(chat_response.json()['textResponse'])
            print("")
        except ValueError:
            return "Response is not valid JSON"
        except Exception as e:
            return f"Chat request failed. Error: {e}"

Altogether, the blocking_chat function looks like this, and takes the message to be sent as an argument:

def blocking_chat(self, message):
        global stop_loading
        stop_loading = False
        loading_thread = threading.Thread(target=loading_indicator)
        loading_thread.start()

        data = {
            "message": message,
            "mode": "chat",
            "sessionId": "example-session-id",
            "attachments": []
        }

        chat_response = requests.post(
            self.chat_url,
            headers=self.headers,
            json=data
        )

        stop_loading = True
        loading_thread.join()

        try:
            print("Agent: ", end="")
            print(chat_response.json()['textResponse'])
            print("")
        except ValueError:
            return "Response is not valid JSON"
        except Exception as e:
            return f"Chat request failed. Error: {e}"

Now let’s look at how streaming compares. It consists of two functions, an async wrapper function streaming_chat and another called streaming_chat_async to handle the chunks as they arrive. The wrapper is a simple function that uses asyncio to process the message. It is a single line in addition to the definition line and looks like this:

def streaming_chat(self, message):
        asyncio.run(self.streaming_chat_async(message))

The main function called by the wrapper begins with a couple of variable definitions. The data variable remains the same. The buffer below it is used to collect the chunks as they arrive.

data = {
            "message": message,
            "mode": "chat",
            "sessionId": "example-session-id",
            "attachments": []
        }

        buffer = ""

The data variable remains the same. The buffer below it is used to collect the chunks as they arrive. The next section is long, so it is split into smaller pieces starting with some nested contexts and loops:

        try:
            async with httpx.AsyncClient(timeout=self.stream_timeout) as client:
                async with client.stream("POST", self.chat_url, headers=self.headers, json=data) as response:
                    print("Agent: ", end="")
                    async for chunk in response.aiter_text():
                        if chunk:
                            buffer += chunk

The first context sets up an asynchronous http client to handle the stream, followed by a second context using that client to make the actual POST request with the message at the model endpoint. An error check, not included here, catches any error with the httpx client. The prints statement following these contexts prints the initial label to which the chatbot response will be appended.

Directly following are a couple lines handling chunks from the response. Like the call to the request, these chunks are asynchronously iterated to release execution back to other threads when a chunk is not ready to be processed. Once chunks are received, they are placed into the buffer.

The next couple lines are used to begin processing the data in the buffer:

                            while "\n" in buffer:
                                line, buffer = buffer.split("\n", 1)
                                if line.startswith("data: "):
                                    line = line[len("data: "):]

While there continues to be full lines in the buffer, the buffer processing loop is entered to print the chunk to the screen. If a full line is not detected – in this case using the newline character – the loop exits and waits for more chunks.

To process a line, it is extracted from the buffer and the buffer is updated to remove the now processing line. From there, the line is cleaned. This is necessary because AnythingLLM returns a json dictionary, so the key is removed.

Next, the extracted line is processed:

                                try:
                                    parsed_chunk = json.loads(line.strip())
                                    print(parsed_chunk.get("textResponse", ""), end="", flush=True)

                                    if parsed_chunk.get("close", False):
                                        print("")
                                except json.JSONDecodeError:
                                    # The line is not a complete JSON; wait for more data.
                                    continue
                                except Exception as e:
                                    # generic error handling, quit for debug
                                    print(f"Error processing chunk: {e}")
                                    sys.exit()

Within the try block, the parsed chunk is loaded as json and printed to the console. The last section checks to see if this is the final message; if so, a new line is printed to provide separation between the user's input and the chatbot’s output. Some light error processing – one for json decoding and one generic catchall – handles any unexpected issues.

Altogether, the streaming_chat functions look like this, and takes the message to be sent as an argument:

def streaming_chat(self, message):
        asyncio.run(self.streaming_chat_async(message))

async def streaming_chat_async(self, message):
        data = {
            "message": message,
            "mode": "chat",
            "sessionId": "example-session-id",
            "attachments": []
        }

        buffer = ""
        try:
            async with httpx.AsyncClient(timeout=self.stream_timeout) as client:
                async with client.stream("POST", self.chat_url, headers=self.headers, json=data) as response:
                    print("Agent: ", end="")
                    async for chunk in response.aiter_text():
                        if chunk:
                            buffer += chunk
                            while "\n" in buffer:
                                line, buffer = buffer.split("\n", 1)
                                if line.startswith("data: "):
                                    line = line[len("data: "):]
                                try:
                                    parsed_chunk = json.loads(line.strip())
                                    print(parsed_chunk.get("textResponse", ""), end="", flush=True)

                                    if parsed_chunk.get("close", False):
                                        print("")
                                except json.JSONDecodeError:
                                    # The line is not a complete JSON; wait for more data.
                                    continue
                                except Exception as e:
                                    # generic error handling, quit for debug
                                    print(f"Error processing chunk: {e}")
                                    sys.exit()
        except httpx.RequestError as e:
            print(f"Streaming chat request failed. Error: {e}")

Finally, the chatbot is instantiated and called with a main function:

if __name__ == '__main__':
    stop_loading = False
    chatbot = Chatbot()
    chatbot.run()

The first line of this block tells the Python interpreter to run this code if the file is called as the main file. The second sets a global variable used to tell the loading indicator thread whether it should be displaying the loading bar. The last two initialize the Chatbot class and run it.

Follow From Zero to Chatbot: 30-Minute Build Challenge Build Along video here for more comprehensive, step-by-step instructions.

Test your Chat App

As described in the previous section, you have the option to use a terminal or Gradio chat interface to talk with the bot. After completing setup, run the app you choose from the command line:

# terminal
python src/terminal_chatbot.py

# gradio
python src/gradio_chatbot.py

Troubleshooting Common Problems

AnythingLLM NPU Runtime Missing

On a machine with Snapdragon X Elite, AnythingLLM NPU should be the default LLM provider. If you do not see it in the dropdown, you downloaded x64 version of AnythingLLM. Delete the app and install the ARM64 version instead.

Model Not Downloaded

Sometimes the selected model fails to download, causing an error in the generation. To resolve, check the model in Settings -> AI Providers -> LLM in AnythingLLM. You should see "uninstall" on the model card if it is installed correctly. If you see "model requires download," choose another model, click save, switch back, then save. You should see the model download in the upper right corner of the AnythingLLM window.

Optional Add-on: Create a Loading Bar

To implement a multi-threaded loading bar for inference tasks in AnythingLLM, you’ll need to use Python's threading and queue modules.

1. Core Components

  • Main Thread: Handles UI updates and progress bar rendering
  • Worker Thread: Executes the inference task
  • Progress Queue: Thread-safe communication channel between threads
import threading
import queue
import time
from alive_progress import alive_bar  # Optional for advanced animations

2. Thread-Safe Progress Tracking

def inference_task(progress_queue, query):
    # Simulate inference processing
    steps = ["Tokenizing", "Processing", "Generating"]
    for i, step in enumerate(steps, 1):
        time.sleep(1)  # Replace with actual inference work
        progress_queue.put((i/len(steps)*100, step))
    progress_queue.put((100, "Complete"))

3. Loading Bar Thread

def loading_bar(progress_queue):
    with alive_bar(100, title='Processing Query') as bar:
        while True:
            try:
                progress, status = progress_queue.get(timeout=0.1)
                bar.title(f'[{status}]')
                bar(progress - bar.current)  # Increment by delta
                if progress >= 100:
                    break
            except queue.Empty:
                continue

4. AnythingLLM Integration

def run_inference_with_progress(query):
    progress_queue = queue.Queue()
    
    # Start inference thread
    inference_thread = threading.Thread(
        target=inference_task,
        args=(progress_queue, query)
    )
    inference_thread.start()
    
    # Start progress bar in main thread
    loading_bar(progress_queue)
    
    inference_thread.join()
    print("\nInference complete")

Implementation Notes

Thread Synchronization

  • Use queue.Queue for safe inter-thread communication
  • Main thread polls queue every 100ms for updates
  • Worker sends percentage complete and status messages

     

UI Integration

# For web UI integration (Flask example)
@app.route('/chat', methods=['POST'])
def chat():
    query = request.json['query']
    thread = threading.Thread(target=run_inference_with_progress, args=(query,))
    thread.start()
    return jsonify({"status": "Processing started"})

Advanced Features

  • Use alive-progress for animated spinners and throughput stats
  • Add ETA calculations: 
bar.title(f'ETA: {bar.eta}s | {status}')

Architecture Diagram

[User Input]
    │
    ▼
[Main Thread] ─── starts ───▶ [Worker Thread]
    │  ▲                       │
    │  └── progress updates ───┘
    ▼
[Loading Bar Render]
    │
    ▼
[LLM Response]

This pattern maintains UI responsiveness while showing real-time inference progress.

Looking for another extension of this basic project? Try using text stream for asynchronous communication!

Additional Resources

Are you new to programming? Welcome! Learning how to code will help give you useful skills that transfer across a lot of different domains.

Coding teaches you how to break down problems into simpler tasks, and how to identify recurring elements of those tasks—often these are things that can be automated, or, at a minimum, you can reuse code you write to execute that task rather than starting from scratch each time.

There are lots of great programming courses for beginners that you might find helpful:

  • LearnPython.org has free tutorials, including beginner and advanced options.
  • If you learn better when you have a specific problem to solve, you might enjoy Udemy’s project-based offerings. These aren’t free, but Udemy has frequent sales, and you can often find coupon codes.
  • If you’re an absolute beginner, Free Code Camp’s free Python tutorial will get you up and running with the basics in about 4.5 hours.

We’d love to see what you built! Join us on Qualcomm Developer Discord to show off your project! Or, check out Qualcomm AI Hub for more information about using AI models and tools on devices with Qualcomm technology.

FAQ

Q: What is Edge AI and why use Snapdragon NPU?

A: Edge AI refers to running machine learning models directly on local devices (like hardware powered by Snapdragon), reducing latency, minimizing data transfer, and improving privacy.

The Snapdragon NPU accelerates on-device inference, making chat and AI apps run faster and more responsively without relying on the cloud.

Q: What is AnythingLLM and how does it integrate with Snapdragon?

A: AnythingLLM is a lightweight, provider-agnostic framework for running large language models. In this walkthrough, we choose “Qualcomm QNN” as the poster for NPU acceleration. This enables local inference with models like Llama 3.1 8B using Snapdragon’s accelerator hardware.

Q: How can I improve UI responsiveness during inference?

A: Implement asynchronous inference with a worker thread and communicate progress via a queue.Queue to the main thread. Using tools like alive-progress, you can render a loading bar (or Gradio UI) while the NPU processes data in the background, keeping the interface fluid.

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Author
Nick Debeurre
Nick DebeurreSenior Product Manager and AI Developer Advocate
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.