Get hands-on with the Qualcomm Cloud AI 100 Ultra: introducing developer playground
Free and available now: Introducing playground for Qualcomm Cloud AI 100 that offers developers a straightforward way to experiment with AI models and explore the card’s performance. With hands-on access to APIs including Llama 3.1-8B, Llama 3.1-70B, and SDXL Turbo, developers can experience the power of the Qualcomm Cloud AI 100 Ultra.
What is Qualcomm Cloud AI 100 Ultra?
Purpose-built for generative AI and large language models, the Qualcomm Cloud AI 100 Ultra delivers top-tier performance per dollar, making it an ideal choice for large-scale AI inference deployments. Its specifications enable it to run demanding models like LLaMA 3.1 70B on a single card and support multiple precision formats—FP16 and MxFP6 (micro scaling formats).
The Qualcomm Cloud AI 100 Ultra card is designed specifically for high-throughput AI inference and has some unique advantages for AI developers looking to optimize their deployments.
Why Qualcomm Cloud AI 100 Ultra is a great option for generative AI inference?
Here are some top benefits:
1. High-Throughput Inference without the Power Draw
The Qualcomm Cloud AI 100 Ultra offers performance of 870 TOPS, making it highly efficient for inference tasks. It's built to accelerate generative AI, computer vision and Natural Language Processing (NLP) applications, delivering high-speed inference while minimizing power consumption. The card provides the throughput needed to handle large-scale workloads effectively.
2. Efficiency: Performance at 150W TDP
With a 150W TDP, the Qualcomm Cloud AI 100 Ultra offers a lower power footprint compared to many data center accelerators. This allows you to maximize inference power within a single rack without exceeding power and cooling limits. It's also ideal for edge deployments, where power constraints are a critical consideration.
3. Serious Memory for Big Models
This card comes with 128GB of LPDDR4x memory and 576 MB of on-die SRAM. That’s a ton of on-card memory, paired with 548 GB/s bandwidth, so you can handle larger models and datasets with ease. Higher memory capacity allows you to increase batch sizes during inference, which increases the overall throughput for real world deployments. If you’re running models that demand a lot of memory, like those in video analytics or large language models, this card gives you the headroom to operate efficiently.
4. Developer-Friendly Software Stack
The Qualcomm Cloud AI 100 Ultra is supported by the Qualcomm Cloud AI Stack, which can ingest the models coming from popular AI frameworks such as TensorFlow, PyTorch, and ONNX. At the core of this stack is the AI100 Runtime, designed to maximize the hardware's performance while ensuring compatibility with a wide range of AI models. This runtime can work seamlessly alongside the ONNX Runtime, vLLM and Triton Inference Server.
For developers working with transformer models, Qualcomm Technologies provides the Efficient Transformers library, available on GitHub. This library is designed to make it easy for developers to port pretrained models from the Hugging Face (HF) hub into optimized, inference-ready formats that run efficiently on Qualcomm Cloud AI 100 accelerators. It includes re-implemented blocks for large language models (LLMs), fine-tuned to achieve high performance on Qualcomm Technologies' hardware. Models can be directly converted from their original pretrained state to an optimized deployment-ready form.
In addition to model optimization, the software stack offers a suite of developer tools, such as profilers, debuggers, and emulation environments, to streamline development and testing. The stack is also compatible with multiple operating systems, including Red Hat, Ubuntu, and CentOS, enabling easy integration into various data center setups.
The inference services are built on top of underlying robust Cloud AI stack and provide easy to use OpenAI-compatible APIs for Chat Clients, RAG, Embeddings, and Image Generation. It hides away all the LLM serving complexities from the user and provides swift development experience from development through deployment. Starting a conversation with LLM can be done with few lines of code as shown in snippet below.
from imagine import ChatMessage, ImagineClient
client = ImagineClient(api_key="my-api_key")
chat_response = client.chat(
messages=[ChatMessage(role="user", content="What is the best Spanish cheese?")],
model="Llama-3.1-8B",
)
print(chat_response.first_content)5. Perfect for Cloud and Edge Setups
The Qualcomm Cloud AI 100 Ultra isn’t limited to just data centers; its power efficiency and compact form factor make it an excellent choice for edge deployments as well. With PCIe Gen 4, 16 lanes, and a full-height, 3/4 length design, it can be integrated into a range of environments—whether scaling in the cloud or optimizing for on-site edge processing.
6. More Inference, Lower Costs
Deploying AI in the cloud can quickly become expensive. The Qualcomm Cloud AI 100 Ultra is designed to offer an affordable solution for AI inference, thanks to its power efficiency. By maximizing inferences per watt, it helps reduce operational costs, allowing developers to scale their models without exceeding budget constraints. If you’re a developer relying on cloud resources for running models, this card helps you scale without breaking the bank.
The Qualcomm Cloud AI 100 Ultra is currently available on the cloud at Cirrascale or TensorOpera and Core42 global datacenter.
Explore Playground for Qualcomm Cloud AI
Ready to explore the Qualcomm Cloud AI 100 Ultra capabilities? Try out the Llama-3.1-8B and 70B and SDXL Turbo APIs, powered by the Qualcomm Cloud AI 100 Ultra, through our free playground for Qualcomm Cloud AI. Experience firsthand how it can optimize your AI deployments.
Access playground for Qualcomm Cloud AI now via this link: https://cloudai.cirrascale.com/
Connect with fellow developers, get the latest news and prompt technical support by joining our Developer Discord.

