Back to All
Developer Blog

Build a simple RAG system using the Qualcomm AI Inference Suite

Sign up for Developer monthly newsletter-image

Sign up for Developer monthly newsletter

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Sign up
Come for support, stay for the community-image

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Join Developer Discord

Although a lot has been written about Retrieval-Augmented Generation (RAG) as a valuable pattern for using AI, it continues to be magic for the average business user while being a bit unclear to technical implementers who haven’t used it before.  In this blog post, we’ll showcase a super simple setup to show how RAG works, while using the Qualcomm AI Inference Suite to do all the AI parts.

Setup

Conceptually one can use AI to query some defined set of documents and get an answer from the documents rather than having AI blather on about something outside the specific topic of interest.

When you issue a query to a RAG system, the steps it takes are:

  1. Turn the user’s query into a ‘query embedding.’
  2. Compare the user’s query embedding to an index of previously computed embeddings for all target documents which may contain the ‘answer.’
  3. Retrieve some number of those documents or document fragments. This parameter is often denoted as the ‘top k documents.’
  4. Feed an LLM the user’s query with instructions to answer the question using the context of the data retrieved in the previous step.
  5. Optionally, specify that the LLM should answer in some standard way if the query can’t be answered with the given data. If the AI were a human, it would say something like, “I don’t have enough information to answer” rather than coming up with a wrong answer, aka ‘hallucination.’

A diagram is useful to illustrate the steps we need to take.

Process Diagram showing steps to generate RAG system using AI inference Suite
Figure 1: For documents and/or data generate embeddings, store, and index them. Use the index to compare the embedding of a query and return top k items. Feed query string plus returned top k items to LLM to generate an answer.

One important point that might trip up first-timers is that you need to use the same embedding model, in this case BAAI/bge-large-en-v1.5, for both the initial step of processing your document set and for the embedding of the user’s query. If you don’t – you may still get an answer - but it will be wrong.  Practically, this means that if you decide to change embedding models, you will need to re-build your index and modify your code for processing the query to match.


Scenario code walkthrough

Using the Qualcomm AI Inference Suite running on Cirrascale infrastructure (powered by Qualcomm Cloud AI accelerators), I’ll demonstrate how to build a simple RAG using Python in a Jupyter notebook.

The first bits of code are not too exciting and are similar to other blogs where we do the same things: install the Imagine SDK, import libraries, and set up environment variables to hold our API endpoint and API key.

One addition to this RAG scenario is the use of the FAISS library. FAISS (Facebook AI Similarity Search) is an open-source library developed by Meta AI that enables efficient similarity search and clustering of data in the form of dense vectors. 

More simply stated, we need to be able to search the data inside our documents for similarity to a user’s query.  By identifying that data, we now have context to provide an LLM with so that it can answer based on your data rather than what it was trained on.

Installation of FAISS is as simple as using pip install to your environment and then importing it:

!pip install faiss-cpu

# at top of your python code import faiss
import faiss

The document set

In a production environment, the document set that we are using with RAG might be in a filesystem, a database, or some other data store. If there is a lot of data, it is necessary to compute embeddings for each piece of data, possibly in chunks if it is too large to fit in our chosen embedding model’s context window and then store all results in a vector database for later retrieval and comparison.

For the purpose of understanding how RAG functions, all those steps add complexity that we will skip for this sample. We want to show the bare minimum of steps that make RAG work so that you can apply this learning to your own scenario, data sets, and capabilities.

With that in mind, we create a small dataset of strings that represent our document set:

# create some data to test embedding
documents = [
    "Ray works at Qualcomm.",
    "Ray is a specialist in developer relations.",
    "Qualcomm AI Inference Suite is great for AI inference workloads",
]

Generate embeddings

Embeddings are numerical representations of data—like words, images, or documents—that capture their meaning, context, or relationships in dimensional space. They allow machines to compare and understand inputs by measuring similarity between these representations. 

Next, we need to create embedding representations for our document set above and then create an index which will allow us to search for those documents with the closest similarity to a user query.  This is the key way that RAG provides context to an LLM to answer questions from specific data sets rather than answering from whatever data the LLM was originally trained on.

The following code sets up the embedding model we’ll use to generate the embeddings, does the actual embedding calculations, and creates our index using FAISS.  FAISS is a bit complicated to understand from code alone, so it is worthwhile to read the documentation if you want to really understand what is going on here.

# check installed models and grab one to use
all_models = client.get_available_models_by_type()
embedding_models = client.get_available_models_by_type(ModelType.EMBEDDING)
use_embed_model = embedding_models.get(ModelType.EMBEDDING, 0)[1]
pprint(use_embed_model)

# create embeddings and list
doc_embeddings = client.embeddings(documents, model=use_embed_model)
doc_embeddings = doc_embeddings.data
for item in doc_embeddings:
    pprint(item)

# get embeddings into the right format for FAISS use
embeddings_only = []
for item in doc_embeddings:
    embeddings_only.append(item.embedding)
doc_embeddings = np.array(embeddings_only).astype("float32")
pprint(doc_embeddings.shape)

# create FAISS index
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)

Find relevant data given a user query

To see if we have relevant data from our document set to answer a user query, we create an embedding of the user’s question, and then search our index for the closest answer.  Note, it is possible that the user query is about something that isn’t in the data set.  This step simply returns the top N number of data pieces as context for an LLM to answer the question.

# query and retrieve relevant document
query = "How can I do AI inference?"
query_embedding = client.embeddings([query], model=use_embed_model).data[0].embedding

query_vector = np.array(query_embedding).astype("float32")
query_vector = query_vector.reshape(1, -1) # Reshape to (1, dimension)
D, I = index.search(query_vector, k=1)

retrieved_doc = documents[I[0][0]]

Ask an LLM to answer the user query with context

The final step is to provide an LLM with a system prompt containing guardrails and instructions on what to do if the context doesn’t answer the question.  We also provide the user’s question, and any data retrieved as context.

# Let's call an LLM to have it answer with the provided data
payload = {
    "model": "Llama-3.1-8B", # can try other models as well
    "messages": [
        {"role": "system", "content": "Answer the question using the provided context.  If you can't answer using the provided context, say that the data is not in the document set."},
        {"role": "user", "content": f"Context: {retrieved_doc}\nQuestion: {query}"}
    ]
}
chat_response = client.chat(messages=payload["messages"], model=payload["model"])
pprint(chat_response.first_content)

Try it out yourself

Using the sample code in a Jupyter notebook, you can give it a go yourself. Try changing the data set to include different information, making it as long as you like. Try changing the query to both things you know are in the data set and things that aren’t.  Try changing the guardrails to modify the output of the LLM at the end.

After using this sample, let us know over on the Qualcomm Cloud AI Discord channel what you’ve created for your own scenarios. Be sure to sign up for free tokens and retrieve your API key from our partner Cirrascale.  Explore other topics in the Cloud AI blog series.

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Qualcomm-branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Author
Ray Stephenson
Ray StephensonDeveloper Relations Lead, Cloud
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.