Real-Time Text-to-Image Synthesis with Adversarial Diffusion Distillation (ADD) Models on Qualcomm Cloud AI 100
Co-written with Bonan Zhang.
Have you heard about SDXL Turbo? It’s an adversarial diffusion distilled version of SDXL text-to-image model that can generate images based on textual prompts in real time. On Qualcomm Cloud AI 100 Ultra (TDP of 150W), we can generate 4 images for 4 different prompts every ~250ms. Read on to learn more about the methods today and try yourself in the Playground for Qualcomm Cloud AI.
DeveloCloud AI
Jan 20, 2025 | 0:39

Synthesizing an image that is described by a given textual description is one of the widely used applications of deep generative AI models. The mathematical frameworks behind these models have been varied and many of them are nearly a decade old: Variational Autoencoders (2012), Generative Adversarial networks (GANs) (2014), and Normalizing Flows (2015). More recently, denoising diffusion probabilistic models (2015, 2020) and their generalizations such as flow matching (2022) have emerged as powerful alternatives that excel in creating high-quality images.
Horns of Trilemma
The phrase “Generative Learning Trilemma” was coined by Xiao et al,. in 2022 to describe the three requirements of generative models, and none of the mathematical frameworks mentioned above satisfy all three requirements.
The requirements are
- Ability to generate high-quality samples
- Ability to produce diverse outputs representing the whole distribution (aka mode coverage)
- Ability to generate the samples fast
Requirement |
VAE/Normalizing Flows |
GANs |
Diffusion Models |
High-Quality Samples |
No |
Yes |
Yes |
Diverse Outputs |
Yes |
No |
Yes |
Fast Sampling |
Yes |
Yes |
No |
As the above table shows, the GANs can generate high-quality samples fast — because it involves just 1 evaluation of the generator neural network, referred to as requiring 1 NFE (network function evaluation). The diffusion models can also generate high-quality images and even excel GANs (Dhariwal, 2021) but are much slower due to their iterative nature of sampling and require 100’s of NFEs.
The diffusion models generate outputs by starting with a random input and going through a sequence of denoising steps that remove randomness by gradually introducing structure to the output. In the case of conditional generation, like generating an image that corresponds to given text prompt, using the classifier-free guidance (CFG) technique, each denoising step involves 2 invocations of the denoising neural network. The CFG concept offers trade-off between diversity and sample quality through a single parameter called the guidance strength. The CFG can also be used to restrict the model to avoid certain concepts in the generated image, which is specified via a ‘negative prompt’. So, diffusion models require NFEs equal to 2 times the number of denoising steps.
Towards Tackling the Trilemma
To tackle the trilemma, an active area of research involves speeding up the sampling process of diffusion models to improve upon its deficiency.
- Guidance distillation is a technique which distills the denoising network that requires two NFEs for each denoising step for CFG into another denoising network which requires only 1 NFE.
- Sampling in diffusion models is equivalent to solving a (stochastic/ordinary) differential equation. Instead of solving this differential equation by black-box solvers, crafting solvers that exploit the structure of diffusion differential equations leads to sampling high-quality image in 15 denoising steps (DPM-Solver ++). You can try this solver on AI 100 on Stable Diffusion Model here. The main selling point of this approach is that it is applied at inference time and there is no need to retrain the model.
- Another approach involves reducing the computational requirements of the denoising network by discovering a simpler network using network-architecture search (NAS) and finetuning its weights. The model Deci Diffusion was constructed via this approach, and it additionally also reduces the required denoising steps. It is available for use on Qualcomm Cloud AI accelerators, here.
- Yet another approach, the focus of this post, is called Adversarial Diffusion Distillation (ADD). It is a new way of taking a powerful foundational diffusion model like SDXL and making it generate appealing, high-quality images conditioned on text with few NFEs. This results in very fast image generation.
Adversarial Diffusion Distillation
ADD is an innovative training technique presented by Sauer et al. and was used to create the SDXL Turbo model by distilling SDXL, a large foundational diffusion model. It generates images with 512x512 resolution from text prompts using 1 to 4 denoising steps without classifier guidance. The distinguishing feature is the loss function used in training. It consists of two components:
- Score Distillation Sampling (SDS) Loss
- Adversarial (AD) Loss.
The SDS Loss was introduced in the pioneering work from Poole et al., for text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models. It enabled the transfer of knowledge embodied in pretrained foundational 2D models to downstream tasks. But, by itself, it is known to produce blurry, oversaturated images. By incorporating AD Loss, the ADD approach is able to generate high-fidelity sharp images in just a few denoising without introducing distracting artifacts. ADD is also shown to outperform other models such as LCM and single-step GANs.
SDXL Turbo
In summary, SDXL Turbo is a model with the following characteristics
- Generate 512x512 images from text prompts.
- Number of NFEs: 1 to 4.
- No negative prompts or CFG
Sample Images Created with just 1 NFE
- “A dramatic shot of a classic detective in a trench coat and fedora, standing in a rain-soaked alleyway under a dim streetlight.”
- “The dreamlike digital art captures a vibrant, kaleidoscopic bird in a lush rainforest.”
- “An origami pig on fire in the middle of a dark room”
Experimental Results
In this section, we present the results of running SDXL Turbo on Qualcomm Cloud AI 100 Ultra.
We keep the precision in FP16 for fidelity reasons. Additionally, we experiment on multiple inference steps to thoroughly evaluate the performance. Furthermore, we include an option to switch to the Tiny AutoEncoder for SDXL (TAESDXL), which is distilled from the original SDXL VAE.
This option offers significantly reduced latency for the VAE module, with a trade-off in the form of missing some fine details of generated images.
Next Steps
SDXL Turbo enables fast text-to-image generations with only a few sampling steps. Our experiments have showcased the efficient low-power image generation using SDXL Turbo, running on Qualcomm Cloud AI accelerators.
Such efficiency motivates us for further explorations, such as fast image generations with higher resolution and more advanced text-to-image models.
Get started today with the Playground for Qualcomm Cloud AI.
Connect with fellow developers, get the latest news and prompt technical support by joining our Developer Discord.

