Developer Blog

Real-Time Text-to-Image Synthesis with Adversarial Diffusion Distillation (ADD) Models on Qualcomm Cloud AI 100

Written by

Natarajan Vaidhyanathan

Jan 24, 2025

Co-written with Bonan Zhang.

Have you heard about SDXL Turbo? It’s an adversarial diffusion distilled version of SDXL text-to-image model that can generate images based on textual prompts in real time. On Qualcomm Cloud AI 100 Ultra (TDP of 150W), we can generate 4 images for 4 different prompts every ~250ms. Read on to learn more about the methods today and try yourself in the Playground for Qualcomm Cloud AI.

DeveloCloud AI

Jan 20, 2025 | 0:39

Video Player is loading.

Current Time 0:00

Duration 0:39

Loaded: 15.36%

Stream Type LIVE

Remaining Time 0:39

Synthesizing an image that is described by a given textual description is one of the widely used applications of deep generative AI models. The mathematical frameworks behind these models have been varied and many of them are nearly a decade old: Variational Autoencoders (2012), Generative Adversarial networks (GANs) (2014), and Normalizing Flows (2015). More recently, denoising diffusion probabilistic models (2015, 2020) and their generalizations such as flow matching (2022) have emerged as powerful alternatives that excel in creating high-quality images.

Horns of Trilemma

The phrase “Generative Learning Trilemma” was coined by Xiao et al,. in 2022 to describe the three requirements of generative models, and none of the mathematical frameworks mentioned above satisfy all three requirements.

The requirements are

Ability to generate high-quality samples
Ability to produce diverse outputs representing the whole distribution (aka mode coverage)
Ability to generate the samples fast

Requirement	VAE/Normalizing Flows	GANs	Diffusion Models
High-Quality Samples	No	Yes	Yes
Diverse Outputs	Yes	No	Yes
Fast Sampling	Yes	Yes	No

As the above table shows, the GANs can generate high-quality samples fast — because it involves just 1 evaluation of the generator neural network, referred to as requiring 1 NFE (network function evaluation). The diffusion models can also generate high-quality images and even excel GANs (Dhariwal, 2021) but are much slower due to their iterative nature of sampling and require 100’s of NFEs.

The diffusion models generate outputs by starting with a random input and going through a sequence of denoising steps that remove randomness by gradually introducing structure to the output. In the case of conditional generation, like generating an image that corresponds to given text prompt, using the classifier-free guidance (CFG) technique, each denoising step involves 2 invocations of the denoising neural network. The CFG concept offers trade-off between diversity and sample quality through a single parameter called the guidance strength. The CFG can also be used to restrict the model to avoid certain concepts in the generated image, which is specified via a ‘negative prompt’. So, diffusion models require NFEs equal to 2 times the number of denoising steps.

Towards Tackling the Trilemma

To tackle the trilemma, an active area of research involves speeding up the sampling process of diffusion models to improve upon its deficiency.

Guidance distillation is a technique which distills the denoising network that requires two NFEs for each denoising step for CFG into another denoising network which requires only 1 NFE.
Sampling in diffusion models is equivalent to solving a (stochastic/ordinary) differential equation. Instead of solving this differential equation by black-box solvers, crafting solvers that exploit the structure of diffusion differential equations leads to sampling high-quality image in 15 denoising steps (DPM-Solver ++). You can try this solver on AI 100 on Stable Diffusion Model here. The main selling point of this approach is that it is applied at inference time and there is no need to retrain the model.
Another approach involves reducing the computational requirements of the denoising network by discovering a simpler network using network-architecture search (NAS) and finetuning its weights. The model Deci Diffusion was constructed via this approach, and it additionally also reduces the required denoising steps. It is available for use on Qualcomm Cloud AI accelerators, here.
Yet another approach, the focus of this post, is called Adversarial Diffusion Distillation (ADD). It is a new way of taking a powerful foundational diffusion model like SDXL and making it generate appealing, high-quality images conditioned on text with few NFEs. This results in very fast image generation.

Adversarial Diffusion Distillation

ADD is an innovative training technique presented by Sauer et al. and was used to create the SDXL Turbo model by distilling SDXL, a large foundational diffusion model. It generates images with 512x512 resolution from text prompts using 1 to 4 denoising steps without classifier guidance. The distinguishing feature is the loss function used in training. It consists of two components:

Score Distillation Sampling (SDS) Loss

Adversarial (AD) Loss.

The SDS Loss was introduced in the pioneering work from Poole et al., for text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models. It enabled the transfer of knowledge embodied in pretrained foundational 2D models to downstream tasks. But, by itself, it is known to produce blurry, oversaturated images. By incorporating AD Loss, the ADD approach is able to generate high-fidelity sharp images in just a few denoising without introducing distracting artifacts. ADD is also shown to outperform other models such as LCM and single-step GANs.

SDXL Turbo

In summary, SDXL Turbo is a model with the following characteristics

Generate 512x512 images from text prompts.

Number of NFEs: 1 to 4.

No negative prompts or CFG

Sample Images Created with just 1 NFE

“A dramatic shot of a classic detective in a trench coat and fedora, standing in a rain-soaked alleyway under a dim streetlight.”

“The dreamlike digital art captures a vibrant, kaleidoscopic bird in a lush rainforest.”

“An origami pig on fire in the middle of a dark room”

Experimental Results 

In this section, we present the results of running SDXL Turbo on Qualcomm Cloud AI 100 Ultra.

We keep the precision in FP16 for fidelity reasons. Additionally, we experiment on multiple inference steps to thoroughly evaluate the performance. Furthermore, we include an option to switch to the Tiny AutoEncoder for SDXL (TAESDXL), which is distilled from the original SDXL VAE.

This option offers significantly reduced latency for the VAE module, with a trade-off in the form of missing some fine details of generated images.

Next Steps

SDXL Turbo enables fast text-to-image generations with only a few sampling steps. Our experiments have showcased the efficient low-power image generation using SDXL Turbo, running on Qualcomm Cloud AI accelerators.

Such efficiency motivates us for further explorations, such as fast image generations with higher resolution and more advanced text-to-image models.

Get started today with the Playground for Qualcomm Cloud AI.

Connect with fellow developers, get the latest news and prompt technical support by joining our Developer Discord.

AI Cloud Machine Learning

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

About the Author

Natarajan Vaidhyanathan