Back to All
Developer Blog

Learn to optimize Stable Diffusion on Qualcomm Cloud AI 100

Sign up for Developer monthly newsletter-image

Sign up for Developer monthly newsletter

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Sign up
Come for support, stay for the community-image

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Join Developer Discord


Dive in to learn how we achieve a 1.4x latency decrease on Qualcomm Cloud AI 100 Ultra accelerators by applying an innovative DeepCache technique to text-to-image generation. What’s more, the throughput can be further improved by 3x by combining DeepCache with a novel MultiDevice Scheduling technique. Read on!


Background

Stable Diffusion XL (SDXL) is a text-to-image workload, which takes positive and negative text prompts as input and generates a 1024x1024 image corresponding to the prompts as output. Since 2024, this workload has been included in the industry-leading MLPerf Inference benchmark (sometimes called "The Olympics of ML benchmarking").

The overall computation pipeline for SDXL consists of four components: text encoder, text encoder 2, UNet, and VAE decoder. UNet is the most computationally intensive part (90-95% of execution time), requiring separate processing of positive and negative prompts, and performing 20 iterations.

MultiDevice Scheduling

In our original implementation, each of the four components is compiled into a separate binary by the Qualcomm Cloud AI 100 SDK. The binaries are loaded onto a device before execution. Using a mechanism called model switching, the implementation switches between the component binaries when processing a single input sample. The switching is relatively fast (only 100-200 ms), but it does add overhead to the pipeline.

The impact of this switching overhead becomes particularly apparent when comparing the SingleStream and Offline scenarios of MLPerf Inference. In the SingleStream scenario, the system processes one sample at a time, requiring model switching between the components of a single request and between multiple requests. In the Offline scenario, the system processes a large batch of samples (e.g. 5,000 prompts), providing an opportunity to reduce the need for frequent model switching and enabling more efficient processing.

We have further found that computational efficiency can be improved by compiling to use 4 cores and running 4 parallel activations (instances) resulting in a 50% throughput boost for UNet.

To overcome these challenges, we have developed an advanced scheduling technique where the computation is distributed between several accelerator devices in the system. We dedicate a small number of “master” devices to text encoding and VAE decoding, and a larger number of “worker” devices to UNet. This simple adjustment has no effect on the output, but unlocks substantial performance improvements.

Let's break this down with some numbers. On a Pro card, UNet processing takes 20 times longer than VAE decoding. Given this disparity, we found that in a system equipped with 8 Pro cards (8 devices), a 1:7 ratio of master to worker nodes yields optimal results. That is, one master node handles text encoding and VAE decoding, while seven worker nodes focus on parallel UNet execution. This configuration minimises switching overhead while maximizing computational efficiency.

The same principle applies to systems equipped with Ultra cards, where UNet processing takes 26 times longer than VAE decoding. On systems with 2 Ultra cards (8 devices), the same 1:7 ratio of master to worker nodes proves to be most effective.  On systems with 4 or more Ultra cards, a 1:13 ratio proves to be most effective.

By tailoring our pipeline scheduling to the specific characteristics of each system, we are able to extract maximum performance across different hardware configurations.

DeepCache

In a separate effort to optimise SingleStream latency, we explore DeepCache [1], a recent training-free technique that accelerates diffusion models by leveraging the temporal redundancy in the sequential denoising process. DeepCache capitalises on the high degree of similarity observed in high-level features between adjacent denoising steps. By caching these slowly-changing features and reusing them across steps, DeepCache eliminates redundant calculations. We have found that using one Deep iteration followed by two Shallow iterations allows us still to meet the MLPerf Inference accuracy constraints for the Closed division.

Results

For evaluation, we use GIGABYTE R282-Z93 servers equipped with 2 Ultra cards and 8 Pro cards with SDK v1.18.2.0, the same servers as was used for the official MLPerf Inference v4.0 submissions with SDK v1.12.2.0.

Offline

Table 1: SDXL Offline on 2x Qualcomm Cloud AI 100 Ultra cards (8 devices).

Milestone

Samples per second

Speedup

Baseline

0.356

1.00

MultiDevice

0.533

1.50

DeepCache

0.830

2.33

MultiDevice + DeepCache

1.077

3.03

 

Table 2: SDXL Offline on 8x Qualcomm Cloud AI 100 Pro cards (8 devices).

Milestone

Samples per second

Speedup

Baseline

0.608

1.00

MultiDevice

0.827

1.36

DeepCache

1.195

1.97

MultiDevice + DeepCache

N/A

N/A

 

SingleStream

Table 3: SDXL SingleStream on 1x Qualcomm Cloud AI 100 Ultra card (2 devices).

Milestone

Milliseconds per sample

Speedup

Baseline

11,669

1.00

DeepCache

8,335

1.40

 

Table 4: SDXL SingleStream on 2x Qualcomm Cloud AI 100 Pro card (2 devices).

Milestone

Milliseconds per sample

Speedup

Baseline

7,429

1.00

DeepCache

4,633

1.60

 

Visualization Tool

To truly understand and optimize our system, we developed an interactive visualization tool. This web-based demo provides real-time insights into node utilisation, task distribution, and performance metrics across the system equipped with Qualcomm Cloud AI 100 accelerators. Users can scrub through the execution timeline, observe task distribution across nodes, and dive deep into performance charts for individual accelerators. This tool not only showcases the efficiency of our distributed architecture but also serves as an invaluable aid for further optimization and debugging.

Qualcomm-image

Next steps

Our work represents a step forward in performance and efficiency of text-to-image generation on Qualcomm Cloud AI 100 accelerators.

Looking ahead, we are exploring even more advanced optimization techniques. We're developing an auto-scheduling system based on compiled model runtime information, leveraging greedy algorithms to dynamically allocate tasks for maximum efficiency. This approach promises to further push the boundaries of SDXL inference performance on Qualcomm Cloud AI 100 accelerators.

Try out the Qualcomm AI Inference Suite including our SDXL text-to-image inferencing.

 

 

[1] Ma, Xinyin, Gongfan Fang, and Xinchao Wang. "DeepCache: Accelerating diffusion models for free." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Authors
Junfan Huang
Junfan HuangSenior R&D Engineer, KRAI
Dr Anton Lokhmotov
Dr Anton LokhmotovFounder & CEO, KRAI
Natarajan Vaidhyanathan
Natarajan Vaidhyanathan
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.