Developer Blog

Announcement: Compass 2.3 powered by Qualcomm delivers cost-efficient high-performance Generative AI inference in UAE

Written by

Ebrahim Popat

Written by

A.K. Roy

Oct 8, 2024

Over the past year, Generative AI has unleashed new and exciting use-cases and applications, such as conversational chatbots, code generation, writing assistance, image and video creation and document processing co-pilots. Businesses across the world, both large and small, are using GenAI to deliver new and better products and experiences to their customers.

Businesses are also evolving from using simpler chatbots and co-pilots to AI Agents, which understand and complete complex tasks encapsulated in a user’s prompt, without any intervention. AI Agents decompose the user prompt into a series of sub-tasks, plan the sub-tasks, connect to and access & use appropriate business databases/datastores, plug-ins such as web-search or email, and self-check for errors & hallucination issues on completion of the task, and then re-try if required to address the errors identified.

Core42, a G42 Company and leading provider of sovereign cloud, AI infrastructure and services in the UAE, and Qualcomm Technologies, Inc., a global leader in high performance at low power AI solutions, launched Compass 2.0, a next-generation enterprise GenAI platform strengthened by Qualcomm Cloud AI accelerators in July, 2024.

Compass 2.3 is available now and hosts the curated set of open-source Generative AI LLMs and image-generation models in Core42’s datacenters in the UAE. The list of open-source models include LLaMA 3.1, Stable Diffusion SDXL and embedding models. Customers can build a wide range of GenAI applications and agents using these models, at competitive pricing.

Compass powered by Qualcomm Technologies feature overview

Compass 2.3 powered by Qualcomm Technologies provides the following curated list of models in UAE, via its API:

LLaMA 3.1 and 3.0, 70B and 8B instruct models: LLaMA 3 model are Meta’s most advanced and capable models to date. The LLaMA 3.1 70B model demonstrates state-of-the-art performance on a wide range of industry benchmarks and is recommended for use by agents because of its conversation and tool-calling capabilities, while being cost-efficient.
Stable-diffusion XL Turbo v1.0 (SDXL-Turbo v1.0): SDXL-Turbo is a fast generative text-to-image model that can synthesize photorealistic images from a text prompt.
Mixtral 8x7B and Mistral 7B: Mixtral 8x7B is a cost-efficient mixture of experts (MoE) model.
BGE-large: BGE-large is a popular embedding model.

Customers and users can build a broad range of GenAI use-cases, applications and agents, from a simple conversational chatbot, with RAG support, to a complete AI agent creating personalized marketing emails composed of both text and images, for retail end-customers and emailing the same to the intended recipients.

All the above models run in Core42’s UAE datacenters, on the purpose-built Qualcomm Cloud AI 080 Ultra.

The Qualcomm Cloud AI 080 inference accelerator integrates a scalable sixth-generation multi-core programmable architecture and supports cost-efficient high-performance inference on a broad range of GenAI, LLM, ImageGen, NLP, CV and automotives models, including LLaMA, Mixtral, Stable-diffusion, JAIS, CodeGen and others.

The accelerator also supports multi-card capability, which enables up to four accelerators to function as one large accelerator, with an aggregate performance of over 2.5 INT8 PetaOps, 888 FP16 TFlops, 2.2 TB/s of memory BW, 2.3 GB of on-die SRAM capacity and 512 GB of on-card memory capacity.

The accelerator form-factor also enables the dense integration of sixteen accelerators into a single 2U inference server, with an aggregate capacity of over 2 TB on memory capacity, capable of supporting the largest of open-source models.

The Qualcomm Cloud AI 080 software stack and compiler support all prevalent and familiar GenAI and LLM high-performance algorithms and software technologies including Tensor-parallelism, Pipeline-parallelism, attention mechanisms, vLLM, continuous batching, flash attention, speculative decoding, MXFP6 format, AWQ quantization and KV-Cache enhancements.

On a LLM model such as LLaMA 3.0 8B, the Qualcomm Cloud AI 080 is competitively priced on tokens/TCO$ on real-time and offline use-cases, compared to alternative inference solutions.

More information on the Qualcomm Cloud AI product portfolio can be obtained at Cloud AI 100 product page.

Inference calls from customer applications using the Compass AP, are routed by Compass to the Qualcomm Cloud AI 100 Inference-service platform that manages the accelerator capacity. The Inference-service platform is a Kubernetes-based platform that supports multiple Compass calls concurrently via a high-performance API server, with built-in queue-based auto-scaling supported at a per-model level. The inference-service platform supports both request-response and streaming modes and has built-in observability and telemetry. The platform also has the capability to isolate tenants/access-keys to a specific set of Qualcomm Cloud AI 080 cards.

Compass power by Qualcomm Technologies supports comprehensive data privacy and security features. User data, including prompts (inputs) and completions (outputs) are not stored in the platform, and are not available to any other user of the platform or any other service. The Inference-service platform does not interject or intercept any completion (output) quality and abuse filtering or model statelessness, that the model providers have implemented and delivered in the models. The Qualcomm Cloud AI 080 accelerator also supports a model-confidentiality feature that enables model providers to encrypt their model, to protect against unauthorized use of the models. The Qualcomm Cloud AI 080 accelerator hardware also supports all hyperscalar-cloud reliability, availability and security features such as secure boot, secure firmware, memory reliability and attestation.

Compass API calls for models hosted by Qualcomm Technologies’ accelerators are routed to and serviced by a high-performance inference-service platform that manages the accelerator's capacity.

Compass API calls for models block diagram

Additional open-source models will be added, in subsequent Compass releases, including LLaMA 3.2 models and others, to meet the needs of UAE businesses and customers.

Getting started on building applications with Compass powered by Qualcomm Technologies

Customers can get started with building applications and agents using LLaMA or the other models available in Compass powered by Qualcomm Technologies, by contacting Core42 Compass Sales at compass.support@core42.ai and obtaining an appropriate Compass access key, with access to the models on interest, with appropriate token rate limits.

In the example below, we build a simple chat application that uses the LLaMA3.1-70B model available via the Compass API, in three easy steps. Within the steps, we build a simple function that creates and launches a call to the LLaMA3.1-70B model, and post-processes the response.

The chat application has a user-interface shown below, where a chat user can type in prompts in the text-box at the bottom of the application and then click on the “Send” button, to launch the Compass API call to the model. In the example chat displayed below, the user prompt was “When did GiTEX start?”. The response from the model is subsequently displayed in the application window, along with the user query.

Step 1. Declare required Compass API call arguments

API call arguments that need to be provided are:

(a) API_key

(b) Type of inference: chat or embeddings

(d) Content: the user prompt

(e) Stream mode: where streaming mode is required

import argparse
import requests
parser=argparse.ArgumentParser()
parser.add_argument('api_key', type=str, help='Provide Compass API Key')
parser.add_argument('type_of_inference', type=str, help='chat|embeddings')
parser.add_argument('model_name', type=str, help='provide a model name')
parser.add_argument('content', type=str,  help='ask a question to chat or create an embeddings')
parser.add_argument('stream', type=str ,help='choose whether you need a streaming response or not in chat')
args=parser.parse_args()
print(args)

Step 2. Define the function which will parse the Compass API call arguments, create and launch the Compass API call

Create and launch the API call to the appropriate end-point using the API call arguments provided by the user.

def main(args):

    # parses the API call argument to retrieve the call arguments 

    api_key=args.api_key
    type=args.type_of_inference
    model_name=args.model_name
    content=args.content
    stream=args.stream
    if stream == 'False' or stream == 'false':
        stream = False
    else:
        stream = True

    # if the inference type is chat, then create and launch API call to the model and process the response 

    if type == "chat":
        payload = {
            "model": model_name,
            "messages": [{
                "role": "user",
                "content": content
            }],
            "stream": stream
        }

        base_url = "https://api.core42.ai/v1/chat/completions"
        headers = {
            'Content-type': 'application/json',
            'Accept': 'application/json',
            'Cache-Control': 'no-cache',
            'api-key': api_key
        }
        response = requests.post(base_url, json=payload, verify=False, headers=headers)
        streamed_response = []
        if stream:
            for chunk in response.iter_lines():
                streamed_response.append(chunk.decode('utf-8'))
            print(streamed_response)
        else:
            print(response.text)
    elif type == "embeddings":

       # if the inference type is embeddings, then create and launch API call to the embedding model and process the response 

        payload ={
            "input": content,
            "model" : model_name
        }
        base_url = "https://api.core42.ai/v1/embeddings"
        headers = {
            'Content-type': 'application/json',
            'Accept': 'application/json',
            'Cache-Control': 'no-cache',
            'api-key': api_key
        }
        response = requests.post(base_url, json=payload, verify=False, headers=headers)
        print(response.text)
    else:
        print("Required type is not supported")

if __name__ == "__main__":
    main(args)

Step 3. Use the above function in the chat application

Call the above function repeatedly in the chat application, with the appropriate arguments, as mentioned below:

(a) API_key: the Compass API_key obtained from the Compass team, with access to the LLaMA-3.1 70B model

(b) Type of inference: chat

(d) Content: retrieve the current user prompt and catenate the responses from the preview calls

(e) Stream mode: stream

That’s it. With just a few lines of code, we have made a call to the LLaMA-3.1 70B model, hosted by the Compass powered by Qualcomm service, and have used the function in a chat application.

Next steps

Learn more about Compass and the Compass 2.0 launch announcement.

Get started with Qualcomm Cloud developer tutorials and documentation.

To learn more about the Qualcomm Cloud product performance, supported list of models and other product announcements, see related articles on Developer Blog.

AI Machine Learning Cloud Partner

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

About the Authors

Ebrahim PopatProduct Owner

A.K. RoyDirector of Product Management