AI inference at scale with AI250 rack-scale Platform

Active

Qualcomm Dragonfly™

AI250

Contact Sales

On this page

Introduction

Benefits

Features / Specs

Rack-scale AI inference with breakthrough memory bandwidth — engineered for the real-time agentic AI and low TCO.

Qualcomm Dragonfly™ AI250, our second-generation, rack-scale AI inference platform introduces the groundbreaking Qualcomm^® High Bandwidth Compute (HBC) Gen 1 — enabling 133 TB/s of effective memory bandwidth per card (18x more than AI200). Combining massive memory bandwidth and capacity enables fast, low-latency inference on SOTA models up to 10T parameters and context lengths to 1M tokens, all within an ORv3 compliant air- and direct liquid-cooled rack.

Built for hyperscale agentic AI, it enables disaggregated inferencing at superior token-per-watt and token-per-dollar.

Product license agreement

Real-Time Token Generation

With industry-leading 133 TB/s effective memory bandwidth per card¹ addressing memory-bound AI inference decode, HBC enables fast token generation.

Low Latency per Token

Massive effective memory bandwidth keeps decode pipelines fed, reducing latency per token for more responsive real-time agentic AI experiences.

Large Context Window

With 768 GB per card, agentic workloads with long reasoning chains are designed to minimize context eviction.

Total Cost Advantage at Scale

AI250 is engineered to deliver superior total cost of ownership (TCO) versus competition at iso tokens-per-second-per-user — making real-time inference economically viable at scale.

Superior Tokens per Watt

4x-8x better performance per watt compared to contemporary GPU-based architectures on memory bandwidth per watt per card.²

Deploy AI Faster

Qualcomm AI Inference Suite streamlines deployment across bare metal, cloud VMs, and inference-as-a-service, helping teams move from model to production faster.

¹Compared to competing published product specifications normalized at card- and rack-level

²Qualcomm estimates compared to contemporary GPU-based architectures on memory bandwidth per watt per card

Features

Industry-leading effective memory bandwidth¹ to deliver compelling cost per token on real-time inferencing on SOTA models up to 10T and context lengths up to 1M.
Designed for memory-bound and real-time inference at hyperscale economics: LLM decode, reasoning, agentic AI, and multimodal generation
HBC architecture for industry-leading performance per watt²
43 TB memory capacity and 7.4 PB/s effective bandwidth with HBC per rack
Over 6 TB of HBC memory per server, capable of supporting over 10T parameter model, reducing networking dependency.
PCIe Gen6 scale-up; Ethernet-with-RoCE scale-out
Qualcomm AI Inference Suite for bare-metal, VM, or IaaS deployment
Rack-scale solution with liquid cooling, storage, network switches, and NICs
Air and direct liquid cooling
OCP ORv3-compliant rack with cableless backplane

Specifications

Rack

Form Factor

Single Wide, Open Rack v3 (ORv3) compliant

Number of Cards

Memory

Capacity

43 TB

Effective Bandwidth

7.455 PB/s1

Scale-Up

Interface Type

PCIe 6.0

Scale-Out

Interface Type

Ethernet w/RoCE

Thermal Management

Cooling

Direct Liquid Cooling (DLC), Air Cooling

Thermal Design Power

140 kW

With HBC Gen 1 (18x compared to AI200)

QUALCOMM DRAGONFLY AI INFRASTRUCTURE MANAGEMENT SUITE

Orchestrate and
scale efficiency

The suite provides provisioning, monitoring, orchestration and fault handling across rack‑scale deployments. Together, hardware, connectivity and software form the foundation of a cohesive data center platform approach — one designed to scale with customers as AI workloads evolve.

Learn more