This post was written in collaboration with the AI Services Team of SECO S.p.A.
- Running edge AI is a heavy lift for most standard hardware.
- SECO has developed an edge AI assistant that runs on a dedicated system on module (SoM) powered by the Qualcomm Dragonwing QCS6490 to use an LLM with RAG.
- All of the work involved in the LLM and RAG runs on the SoM, eliminating the latency of shuttling data to and from the cloud and improving security and privacy.
How can you get the most out of a large language model (LLM) and retrieval-augmented generation (RAG) on an edge AI device? When you run AI at the edge, your resources are constrained: limited memory, tight network bandwidth and low power, with no connection to the cloud or data center. Even when your model is optimized for edge inferencing, it’s still a heavy lift for the hardware.
That’s why SECO has developed an edge AI assistant, powered by the Qualcomm Dragonwing QCS6490, that runs on the SoC and uses an LLM with RAG. The solution is compact enough to operate on a single system-on-module (SoM). At Embedded World 2025, SECO demonstrated the AI assistant running on their SoM, which is powered by the Dragonwing QCS6490 processor.
The Dragonwing QCS6490 processor is purpose-built for high-performance edge computing, combining up to 8-core Qualcomm Kryo CPUs, integrated Qualcomm Adreno GPU and a powerful AI engine (NPU + DSP), delivering up to 12 TOPS. The platform enables real-time, on-device processing for compute-intensive applications like autonomous robotics, smart vision and industrial automation.
Building an AI assistant without the cloud
Amid the growing focus on deploying AI agents and LLMs at the edge, SECO has found a market opportunity in testing the computational efficiency of hardware. They stress-test hardware to explore its physical boundaries while providing multiple solutions to customer problems.
They designed and built the SECO SOM-SMARC-QCS6490, a high-performance edge AI computing module, around the Dragonwing QCS6490 processor. As a test case for AI and the Internet of Things (IoT), they developed the AI assistant to perform on-device inference and real-time response generation with no reliance on cloud-based services.
The assistant uses a 3-billion-parameter Llama (Llama 3B) model with RAG to deliver precise, contextually relevant responses from a structured knowledge base. It processes user input in two stages:
1. Pre-processing and query optimization
- Cleans, reformulates and synthesizes user input, optimizing query for the task of retrieving documents
- Enhances the efficiency of database searches and aligns request with product documentation
2. Contextual document retrieval and response generation
- Searches and retrieves relevant documents from the dataset
- Passes those documents as additional context along with the user’s query
- Generates a final response based on both the query and the retrieved documents, to ensure greater accuracy and relevance
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
Moving a chat application and RAG from the cloud to an edge device
The AI assistant test case involved moving two workloads to the edge device: a chat application based on a Llama model and a RAG pipeline. Typically run in the cloud with ample resources, those workloads need to be optimized to work well on edge devices.
To develop the test case for the SoM, SECO engineers used:
- Their own Clea OS, a custom Yocto OS
- Clea AI Studio, a framework for developing and testing algorithms
- The open-source Llama.cpp library. Llama.cpp is specifically engineered for executing AI models like LLMs directly on system-on-chips (SoCs) like the Dragonwing QCS6490 processor and taking advantage of local processing capabilities.
- Llama 3B LLM
- The Qualcomm AI Hub to optimize performance of the Llama model on the hardware
- The Qualcomm AI Engine Direct SDK for distributing workloads among hardware cores (CPU/GPU/NPU)
The engineers also optimized the retrieval and generation pipeline. Their tip to developers is to invest time in building a well-structured knowledge base and refining retrieval models to improve accuracy.
The development and testing of the AI assistant showcased at Embedded World 2025 involved SECO’s AI Services team. Other team members contributed on specific technical, integration and coordination tasks, and the team benefited from technical support provided by Qualcomm Technologies, Inc.
Results
SECO’s priority in working with Llama.cpp on the edge device was to maximize the number of tokens generated per second. As a result of the implementation, the two workloads described above run 10.3 tokens per second on average on the SECO SOM-SMARC-QCS6490.
The first workload is the chat application itself. Users interact with the LLM directly on the device through a chat interface. They can ask general knowledge questions based on the LLM’s training data, which eliminates the need for cloud-based processing for these basic interactions. The second, more advanced workload is RAG. It augments the LLM’s capabilities by retrieving post-training information from local data sources. Those sources, such as databases and document collections, reside on the device.
Without relying on the cloud, all the work involved in the LLM and RAG runs on the SoM. The solution eliminates the latency of shuttling data to and from the cloud and improves the security and privacy profile for data stored locally.
Next steps
SECO’s goal is to offer system integrators and OEMs the flexibility to build future-ready edge solutions while maintaining control over power consumption, form factor and deployment costs. They foresee a growing role on edge devices for applications like predictive maintenance, computer vision and local LLMs. They believe that Qualcomm Technologies’ evolving product portfolio is well suited to that role.
They rely on the Dragonwing QCS6490 processor for its powerful yet efficient AI performance in verticals such as industrial automation, medical devices and IoT systems. They are also evaluating the chipset for their Modular Vision HMI family, designed for powerful edge AI applications.
For even greater efficiency they are developing a SMARC module based on the Dragonwing QCS5430 processor. The chipset, which SECO values for its flexibility, allows for in-the-field software upgrades to CPU performance.
And for premium-tier devices, they have launched a new project for a module built on Snapdragon X processors, offering even more AI accelerator performance with up to 45 tera-operations per second. The COM Express module is ideal for edge-based AI and generative AI tasks while staying within the power and thermal envelopes typical in edge deployments.
Discover more about SECO at seco.com.
Come for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.

