Build an NPU-accelerated Edge Chatbot for PCs powered by Snapdragon
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
Sign upCome for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.
Join Developer DiscordTLDR:
This walkthrough provides everything you need to build your own chat app on a PC powered by Snapdragon using AnythingLLM. We describe the setup and testing process, plus a few recommendations for opportunities to expand on the app’s baseline capabilities.
Even if AI isn’t your usual focus area, it’s worth learning to work with and develop with AI models. Core AI models continue to grow in analytical power and capability, and the AI market is expanding with new, task-specific models.
Knowing the basics of how to stand up, test, and modify an AI-powered application is a way to future-proof your skill set. This exercise is perfect for people who are new to developing with AI, or who are looking for a small, fun project that provides opportunities to experiment and play with a simple edge AI setup.
What You’ll Need
- Hardware:
This demonstration was built using the following hardware. The app itself is designed to be hardware agnostic, but you may see differences in performance depending on what hardware you choose. Make sure that you have enough RAM to support local inference. AnythingLLM is very lightweight, and it’s possible to use basic features and store chats with as little as 2GB of RAM.
- Machine: Dell Latitude 7455
- Chip: Snapdragon X Elite
- OS: Windows 11
- Memory: 32 GB
- Software:
- Python Version: 3.12.6
- AnythingLLM LLM Provider: AnythingLLM NPU (for older models, this may be listed as Qualcomm QNN)
AnythingLLM Chat Model: Llama 3.1 8B Chat 8K - Other resources:
Check out this GitHub repository for additional resources and code.
Setup
- Install and set up AnythingLLM. Make sure you choose AnythingLLM NPU when prompted to choose an LLM provider to target the NPU. Choose a model; we used Llama 3.1 8B Chat with 8K context, but you may get better performance from other models depending on the constraints of your hardware.
- Create a workspace by clicking "+ New Workspace"
- Generate an API key
- Click the settings button on the bottom of the left panel
- Open the "Tools" dropdown
- Click "Developer API"
- Click "Generate New API Key"
Open a PowerShell instance and clone the repo
git clone https://github.com/thatrandomfrenchdude/simple-npu-chatbot.gitCreate and activate your virtual environment with reqs
# 1. navigate to the cloned directory
cd simple-npu-chatbot
# 2. create the python virtual environment
python -m venv llm-venv
# 3. activate the virtual environment
./llm-venv/Scripts/Activate.ps1 # windows
source \llm-venv\bin\activate # mac/linux
# 4. install the requirements
pip install -r requirements.txt
Create your config.yaml file with the following variables
api_key: "your-key-here"
model_server_base_url: "http://localhost:3001/api/v1"
workspace_slug: "your-slug-here"
stream: true
stream_timeout: 60
Test the model server auth to verify the API key
python src/auth.pyGet your workspace slug using the workspaces tool
- Run python
src/workspaces.pyin your command line console - Find your workspace and its slug from the output
- Add the slug to the workspace_slug variable in config.yaml
Building the App
With the application configured, it is worth reviewing the code in depth so that you are well equipped to extend it for your own application. As always with code, there are always many ways to do the same thing, so don’t take this as the only way to build a chatbot application.
In addition to the auth.py and workspaces.py utilities mentioned above, this code contains the option to use either the terminal or a Gradio interface to talk with the Chatbot. A terminal is quicker to set up and very lightweight—especially useful if you’re experimenting with tight device constraints—but the user interface is limited. If you haven’t used terminal before, you might find it a bit counterintuitive.
If you choose Gradio, you’ll need to have access to a web browser; that means you’ll be using slightly more system resources than you would with Terminal, but you’ll enjoy a more intuitive user interface.
In addition, both interfaces include a blocking and streaming version set by the streaming Boolean in config.yaml. Blocking waits until the response is fully processed to return it, and streaming returns chunks as they are available.
This code review will only cover the terminal interface since the functionality is similar. Please note, the code is simplified as compared to the GitHub repository for brevity; types and comments are removed.
Terminal Chatbot
To start with the terminal version, you will need the following libraries installed and imported:
import asyncio
import httpx
import json
import requests
import sys
import threading
import time
import yaml
Asyncio, httpx, and requests are used to handle the asynchronous streaming requests to the model server, json and yaml are used to engage with the request responses and config file respectively, and sys, threading, and time are used for the progress bar while the blocking response is processing.
The loading_indicator function is simple: every half second it prints a period in the command line, up to 10, before erasing the line and starting again. This is run in a thread while the model response is processing, and the code looks like this:
def loading_indicator():
while not stop_loading:
for _ in range(10):
sys.stdout.write('.')
sys.stdout.flush()
time.sleep(0.5)
sys.stdout.write('\r' + ' ' * 10 + '\r')
sys.stdout.flush()
print('')The rest of the code, with exception of the invocation discussed at the end, exists in the Chatbot class. The initialization is straightforward: the config file is read, and class variables are assigned for references in the other functions:
class Chatbot:
def __init__(self):
with open("config.yaml", "r") as file:
config = yaml.safe_load(file)
self.api_key = config["api_key"]
self.base_url = config["model_server_base_url"]
self.stream = config["stream"]
self.stream_timeout = config["stream_timeout"]
self.workspace_slug = config["workspace_slug"]
if self.stream:
self.chat_url = f"{self.base_url}/workspace/{self.workspace_slug}/stream-chat"
else:
self.chat_url = f"{self.base_url}/workspace/{self.workspace_slug}/chat"
self.headers = {
"accept": "application/json",
"Content-Type": "application/json",
"Authorization": "Bearer " + self.api_key
}
Take special note of the last few lines of code. There is a check as to whether the user wants to stream, and request headers are defined once for easy reuse with each one.
Following the initialization is the run function, which can be considered as the core function of all chatbots:
def run(self):
while True:
user_message = input("You: ")
if user_message.lower() in [
"exit",
"quit",
"bye",
]:
break
print("")
try:
self.streaming_chat(user_message) if self.stream \
else self.blocking_chat(user_message)
except Exception as e:
print("Error! Check the model is correctly loaded. More details in README troubleshooting section.")
sys.exit(f"Error details: {e}")
On each loop, the function first requests input from the user. The input is checked to know whether the user wants to break out of the chat using special keywords assigned for this purpose. Finally, a response is provided. Wash, rinse, repeat ad infinitum.
Custom, hardcoded functionality can be included by either adding additional keywords or inserting additional code around or between the user input and chatbot response. Agents, a specialized implementation of chatbots capable of independent actions, may include extensive processing before responding to the user. For a simple agent implementation, check out this GitHub repository.
Now let’s dive into the blocking chat functionality.
The first four lines of code deal with the thread for the loading bar. The stop_loading global variable mentioned earlier is referenced, set to False since we want the loading to start, and then a thread is launched with the function. That looks like this:
global stop_loading
stop_loading = False
loading_thread = threading.Thread(target=loading_indicator)
loading_thread.start()
Next, data and chat_response are defined. Data contains a json dictionary to be passed to the API when the request is made. The sessionId key in the dictionary is not used by AnythingLLM; it is there to keep track of the session(s) if you want to do that as the developer. The attachments key is included since AnythingLLM could accept and parse attachments with a chat completion request. Chat_response contains the results of the request made using the standard requests library in Python.
data = {
"message": message,
"mode": "chat",
"sessionId": "example-session-id",
"attachments": []
}
chat_response = requests.post(
self.chat_url,
headers=self.headers,
json=data
)
The last section of the function handles the agent's response. First, the stop_loading variable is triggered, and the loading bar thread is joined back into the main thread. Then, the function attempts to print the message while checking for errors. As with the run function, this can be extended with custom functionality. For example, though AnythingLLM already tracks chat history for context, you may want to track it separately within your application for additional context with each request.
stop_loading = True
loading_thread.join()
try:
print("Agent: ", end="")
print(chat_response.json()['textResponse'])
print("")
except ValueError:
return "Response is not valid JSON"
except Exception as e:
return f"Chat request failed. Error: {e}"
Altogether, the blocking_chat function looks like this, and takes the message to be sent as an argument:
def blocking_chat(self, message):
global stop_loading
stop_loading = False
loading_thread = threading.Thread(target=loading_indicator)
loading_thread.start()
data = {
"message": message,
"mode": "chat",
"sessionId": "example-session-id",
"attachments": []
}
chat_response = requests.post(
self.chat_url,
headers=self.headers,
json=data
)
stop_loading = True
loading_thread.join()
try:
print("Agent: ", end="")
print(chat_response.json()['textResponse'])
print("")
except ValueError:
return "Response is not valid JSON"
except Exception as e:
return f"Chat request failed. Error: {e}"
Now let’s look at how streaming compares. It consists of two functions, an async wrapper function streaming_chat and another called streaming_chat_async to handle the chunks as they arrive. The wrapper is a simple function that uses asyncio to process the message. It is a single line in addition to the definition line and looks like this:
def streaming_chat(self, message):
asyncio.run(self.streaming_chat_async(message))
The main function called by the wrapper begins with a couple of variable definitions. The data variable remains the same. The buffer below it is used to collect the chunks as they arrive.
data = {
"message": message,
"mode": "chat",
"sessionId": "example-session-id",
"attachments": []
}
buffer = ""
The data variable remains the same. The buffer below it is used to collect the chunks as they arrive. The next section is long, so it is split into smaller pieces starting with some nested contexts and loops:
try:
async with httpx.AsyncClient(timeout=self.stream_timeout) as client:
async with client.stream("POST", self.chat_url, headers=self.headers, json=data) as response:
print("Agent: ", end="")
async for chunk in response.aiter_text():
if chunk:
buffer += chunk
The first context sets up an asynchronous http client to handle the stream, followed by a second context using that client to make the actual POST request with the message at the model endpoint. An error check, not included here, catches any error with the httpx client. The prints statement following these contexts prints the initial label to which the chatbot response will be appended.
Directly following are a couple lines handling chunks from the response. Like the call to the request, these chunks are asynchronously iterated to release execution back to other threads when a chunk is not ready to be processed. Once chunks are received, they are placed into the buffer.
The next couple lines are used to begin processing the data in the buffer:
while "\n" in buffer:
line, buffer = buffer.split("\n", 1)
if line.startswith("data: "):
line = line[len("data: "):]
While there continues to be full lines in the buffer, the buffer processing loop is entered to print the chunk to the screen. If a full line is not detected – in this case using the newline character – the loop exits and waits for more chunks.
To process a line, it is extracted from the buffer and the buffer is updated to remove the now processing line. From there, the line is cleaned. This is necessary because AnythingLLM returns a json dictionary, so the key is removed.
Next, the extracted line is processed:
try:
parsed_chunk = json.loads(line.strip())
print(parsed_chunk.get("textResponse", ""), end="", flush=True)
if parsed_chunk.get("close", False):
print("")
except json.JSONDecodeError:
# The line is not a complete JSON; wait for more data.
continue
except Exception as e:
# generic error handling, quit for debug
print(f"Error processing chunk: {e}")
sys.exit()
Within the try block, the parsed chunk is loaded as json and printed to the console. The last section checks to see if this is the final message; if so, a new line is printed to provide separation between the user's input and the chatbot’s output. Some light error processing – one for json decoding and one generic catchall – handles any unexpected issues.
Altogether, the streaming_chat functions look like this, and takes the message to be sent as an argument:
def streaming_chat(self, message):
asyncio.run(self.streaming_chat_async(message))
async def streaming_chat_async(self, message):
data = {
"message": message,
"mode": "chat",
"sessionId": "example-session-id",
"attachments": []
}
buffer = ""
try:
async with httpx.AsyncClient(timeout=self.stream_timeout) as client:
async with client.stream("POST", self.chat_url, headers=self.headers, json=data) as response:
print("Agent: ", end="")
async for chunk in response.aiter_text():
if chunk:
buffer += chunk
while "\n" in buffer:
line, buffer = buffer.split("\n", 1)
if line.startswith("data: "):
line = line[len("data: "):]
try:
parsed_chunk = json.loads(line.strip())
print(parsed_chunk.get("textResponse", ""), end="", flush=True)
if parsed_chunk.get("close", False):
print("")
except json.JSONDecodeError:
# The line is not a complete JSON; wait for more data.
continue
except Exception as e:
# generic error handling, quit for debug
print(f"Error processing chunk: {e}")
sys.exit()
except httpx.RequestError as e:
print(f"Streaming chat request failed. Error: {e}")
Finally, the chatbot is instantiated and called with a main function:
if __name__ == '__main__':
stop_loading = False
chatbot = Chatbot()
chatbot.run()
The first line of this block tells the Python interpreter to run this code if the file is called as the main file. The second sets a global variable used to tell the loading indicator thread whether it should be displaying the loading bar. The last two initialize the Chatbot class and run it.
Follow From Zero to Chatbot: 30-Minute Build Challenge Build Along video here for more comprehensive, step-by-step instructions.
Test your Chat App
As described in the previous section, you have the option to use a terminal or Gradio chat interface to talk with the bot. After completing setup, run the app you choose from the command line:
# terminal
python src/terminal_chatbot.py
# gradio
python src/gradio_chatbot.py
Troubleshooting Common Problems
AnythingLLM NPU Runtime Missing
On a machine with Snapdragon X Elite, AnythingLLM NPU should be the default LLM provider. If you do not see it in the dropdown, you downloaded x64 version of AnythingLLM. Delete the app and install the ARM64 version instead.
Model Not Downloaded
Sometimes the selected model fails to download, causing an error in the generation. To resolve, check the model in Settings -> AI Providers -> LLM in AnythingLLM. You should see "uninstall" on the model card if it is installed correctly. If you see "model requires download," choose another model, click save, switch back, then save. You should see the model download in the upper right corner of the AnythingLLM window.
Optional Add-on: Create a Loading Bar
To implement a multi-threaded loading bar for inference tasks in AnythingLLM, you’ll need to use Python's threading and queue modules.
1. Core Components
- Main Thread: Handles UI updates and progress bar rendering
- Worker Thread: Executes the inference task
- Progress Queue: Thread-safe communication channel between threads
import threading
import queue
import time
from alive_progress import alive_bar # Optional for advanced animations
2. Thread-Safe Progress Tracking
def inference_task(progress_queue, query):
# Simulate inference processing
steps = ["Tokenizing", "Processing", "Generating"]
for i, step in enumerate(steps, 1):
time.sleep(1) # Replace with actual inference work
progress_queue.put((i/len(steps)*100, step))
progress_queue.put((100, "Complete"))
3. Loading Bar Thread
def loading_bar(progress_queue):
with alive_bar(100, title='Processing Query') as bar:
while True:
try:
progress, status = progress_queue.get(timeout=0.1)
bar.title(f'[{status}]')
bar(progress - bar.current) # Increment by delta
if progress >= 100:
break
except queue.Empty:
continue
4. AnythingLLM Integration
def run_inference_with_progress(query):
progress_queue = queue.Queue()
# Start inference thread
inference_thread = threading.Thread(
target=inference_task,
args=(progress_queue, query)
)
inference_thread.start()
# Start progress bar in main thread
loading_bar(progress_queue)
inference_thread.join()
print("\nInference complete")
Implementation Notes
Thread Synchronization
- Use queue.Queue for safe inter-thread communication
- Main thread polls queue every 100ms for updates
- Worker sends percentage complete and status messages
UI Integration
# For web UI integration (Flask example)
@app.route('/chat', methods=['POST'])
def chat():
query = request.json['query']
thread = threading.Thread(target=run_inference_with_progress, args=(query,))
thread.start()
return jsonify({"status": "Processing started"})
Advanced Features
- Use
alive-progressfor animated spinners and throughput stats - Add ETA calculations:
bar.title(f'ETA: {bar.eta}s | {status}')Architecture Diagram
[User Input]
│
▼
[Main Thread] ─── starts ───▶ [Worker Thread]
│ ▲ │
│ └── progress updates ───┘
▼
[Loading Bar Render]
│
▼
[LLM Response]
This pattern maintains UI responsiveness while showing real-time inference progress.
Looking for another extension of this basic project? Try using text stream for asynchronous communication!
Additional Resources
Are you new to programming? Welcome! Learning how to code will help give you useful skills that transfer across a lot of different domains.
Coding teaches you how to break down problems into simpler tasks, and how to identify recurring elements of those tasks—often these are things that can be automated, or, at a minimum, you can reuse code you write to execute that task rather than starting from scratch each time.
There are lots of great programming courses for beginners that you might find helpful:
- LearnPython.org has free tutorials, including beginner and advanced options.
- If you learn better when you have a specific problem to solve, you might enjoy Udemy’s project-based offerings. These aren’t free, but Udemy has frequent sales, and you can often find coupon codes.
- If you’re an absolute beginner, Free Code Camp’s free Python tutorial will get you up and running with the basics in about 4.5 hours.
We’d love to see what you built! Join us on Qualcomm Developer Discord to show off your project! Or, check out Qualcomm AI Hub for more information about using AI models and tools on devices with Qualcomm technology.
FAQ
Q: What is Edge AI and why use Snapdragon NPU?
A: Edge AI refers to running machine learning models directly on local devices (like hardware powered by Snapdragon), reducing latency, minimizing data transfer, and improving privacy.
The Snapdragon NPU accelerates on-device inference, making chat and AI apps run faster and more responsively without relying on the cloud.
Q: What is AnythingLLM and how does it integrate with Snapdragon?
A: AnythingLLM is a lightweight, provider-agnostic framework for running large language models. In this walkthrough, we choose “Qualcomm QNN” as the poster for NPU acceleration. This enables local inference with models like Llama 3.1 8B using Snapdragon’s accelerator hardware.
Q: How can I improve UI responsiveness during inference?
A: Implement asynchronous inference with a worker thread and communicate progress via a queue.Queue to the main thread. Using tools like alive-progress, you can render a loading bar (or Gradio UI) while the NPU processes data in the background, keeping the interface fluid.

