On-Device Agentic AI Workflows with Qualcomm Hexagon NPU and LLMWare.ai
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
Sign upCome for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.
Join Developer DiscordAbstract:
Most enterprise AI today is designed around cloud-hosted models and chat-based interactions, which are inherently reactive and poorly suited for real-world business processes that are repetitive, multi-step, and schedule-driven.
These workflows increasingly require execution closer to the data, where latency, privacy, and operational efficiency are critical. However, existing approaches struggle to automate end-to-end workflows on-device, limiting their ability to deliver continuous, proactive value beyond isolated prompts.
This blog introduces a new paradigm for on-device agentic AI, combining LLMWare’s Model HQ with Qualcomm Hexagon NPUs on PCs with Snapdragon X Series processors to enable automated, multi-step workflows powered by Small Language Models (SLMs) running entirely locally.
Using a no-code, drag-and-drop interface, developers can orchestrate agents that ingest enterprise data (e.g., Jira), apply reasoning, generate insights, and trigger actions on a schedule—without cloud dependency.
While products like Claude Cowork and OpenClaw provide autonomous AI assistants for individual users, Model HQ gives teams a private AI workflow building and deployment system that enables reusable automation across shared business operations. The result is fast, private, local and cost-efficient AI automation, unlocking scalable deployment of thousands of workflows that transform enterprise productivity and demonstrate measurable ROI and token cost savings through real-world use cases.
Solution Architecture:
Application Layer (User / Workflow Interface) provides
- No-code integration to Microsoft Foundry Local and other model repositories
- No-code agent builder and workflow UI
- Custom agents, templates, batch processing, and API integrations
Example: Jira ingestion → prioritization → summarization → CSV + notifications
Model & Knowledge Layer allows
- Ingesting SLMs and local LLMs optimized for Snapdragon NPU
- Integrated AI knowledge management for:
- Document parsing, search, and vector databases
- Prompt management and generation pipelines
- RAG-style workflows and structured processing
Agent Orchestration Layer to create
- Agent-based process orchestration engine
- Multi-step workflow chaining (data → model → logic → action)
- Scheduling for hands-free, time-based execution
Runtime & Inference Layer for flexibility
- Windows ML + ONNX Runtime / GenAI stack
- Qualcomm AI Engine with Qualcomm AI Engine Direct SDK execution providers
- Heterogeneous compute orchestration across:
- NPU (primary for AI inference)
- GPU and CPU fallback paths
Hardware Layer for edge execution on
- Snapdragon X Series platform with on-device NPU acceleration for:
- Low latency inference
- Performance per watt optimized for productivity and speed
- Fully offline / private execution
Build-along time:
Prerequisites:
- Set up LLMWare’s Model HQ for PCs with Snapdragon: LLMWare AI for Complex Enterprises
- Install Microsoft Foundry Local : Get started with Foundry Local - Foundry Local | Microsoft Learn
- Configure your enterprise email account to receive notifications
- Obtain your JIRA integration API token
LLMWare’s Model HQ offers an intuitive, no-code UI to build innumerable workflows that are relevant for the enterprise. Let’s explore one such workflow further.
Use case:
- Automate Jira ticket overload into a daily insight pipeline using LLMWare’s Model HQ no-code agent platform running locally on Snapdragon X2 Elite
- Connect to Jira to filter priority issues and generate summaries using an NPU-optimized model through a simple drag-and-drop workflow
- Establish scheduled runs to produce structured reports and send notifications, delivering timely, actionable insights to stakeholders
Step 1: Integrate Foundry Local into Model HQ and download a SLM
In just a few clicks, integrate Foundry Local and download the model of your choice through Model HQ.
Step 2: Integrate your enterprise JIRA instance
Connecting to your enterprise instance of JIRA is as simple as applying your credentials, in one easy step.
Step 3: Create, edit and test custom services (to later integrate into the agent process)
Set up a custom service to connect to the JIRA instance as the knowledge base—simply define the service, link it to a project, choose the output format, and optionally update the JSON configuration.
Build new and edit or load an existing service.
Step 4: Create an agent pipeline
The Model HQ visual UI builder provides an intuitive way to create agentic pipelines—simply drag and drop nodes to connect tasks, configure inputs/outputs, select the model in Agent Global configurations, and your workflow is ready.
Test pipeline execution.
When the pipeline runs, the model is offloaded to the Snapdragon NPU for execution.
The CSV output file now contains rows processed and filtered for relevance through the agentic pipeline.
Step 5: Set up email integration and a scheduler
Through Model HQ, setup a new integration service to configure emails.
Add a node to the pipeline to route the CSV output to an email distribution list.
…and setup an automated scheduler to have emails generated at the required cadence.
Benefits:
LLMWare’s Model HQ solution enables enterprises to rapidly integrate Microsoft Foundry Local models into their workflows, providing a seamless foundation for building AI-driven automation. By leveraging a no-code agent platform, organizations can deploy fast, private AI workflows that run entirely on-device, powered by Snapdragon NPU for optimized performance and efficiency with WinML, ONNX runtime and Qualcomm AI Engine Direct SDK Execution Providers in the backend.
The architecture also supports extensibility into pro-code environments, giving developers the flexibility to customize and scale solutions as business requirements evolve.
Learn more and give it a try!
What XNNPACK does (and doesn’t) do
When people first hear about XNNPACK, it’s easy to confuse it with a full graph compiler. It isn’t. XNNPACK lives below the graph level. XNNPACK is not a graph compiler! It does not:
- Perform cross-op fusion (e.g., Conv+BN+ReLU, long elementwise chains).
- Do graph rewrites like constant folding or inserting quantize/dequantize nodes.
- Decide op ordering, device placement, or global scheduling.
- Choose quantization strategy or layouts on its own.
All of that is the job of the ML framework (TFLite, PyTorch, ONNX Runtime, etc.). The framework partitions the graph and then hands chunks of it to XNNPACK as “subgraphs” to execute.
How XNNPACK works
XNNPACK focuses on turning those subgraphs into fast kernel calls by providing a subgraph API to build a DAG of operators and lowering each subgraph into calls to optimized microkernels for the chosen architecture.
It also provides packing and layout transformations for weights and activations.
Along the way, it performs operator-specific optimizations, such as weight packing (reordering and interleaving for GEMM/conv), selecting the best kernel variant (tile sizes, MR/NR, data type), using fused operators when available (e.g., SoftmaxFused instead of a chain of ReduceMax → Sub → Exp → ReduceSum → Div).
Come for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.



