Offload Tough Workloads to Adreno GPUs: Acceleration of the Adaptive Loop Filter for the H.266/VVC Software Decoder on Snapdragon Mobile Platforms
The latest state-of-the-art video compression standard, H.266, also known as Versatile Video Coding (VVC), is poised for widespread adoption across various applications and devices. It can enable users to efficiently stream high-quality videos, engage in video calls, and enjoy immersive content at a level of fidelity and efficiency unmatched by other codecs.
Standard’s initial dependence on software-based VVC decoders helps bridge the gap of hardware and dedicated accelerator-based decoder availability. However, achieving a high-quality and performant software decoder for mobile devices (e.g., rendering 4K content at 60fps on premium Android smartphones) poses significant challenges due to VVC’s enormous computational requirements.
The Qualcomm® Adreno™ GPU has evolved beyond mere graphics acceleration. Adreno GPUs lead the industry in performance and energy efficiency metrics, thanks to their advanced architecture, low clock rates, and sophisticated power management. They now serve as versatile compute engines, and enable a wide range of applications, including:
- Image/video processing
- Physics simulations
- Machine learning
- GenAI/large language model workloads
This blog introduces offloading the Adaptive Loop Filter (ALF), one of the most demanding modules in VVC, onto Adreno GPUs via OpenCL. With the help of Adreno, a software-based VVC decoder can achieve real-time 4K decoding performance at 60fps on the Snapdragon® 8 Gen 2 Mobile Platform.
Refer to this white paper for more in-depth technical insights.
Unleash the Parallel Computing Power of Adreno GPUs on Snapdragon Mobile Platforms with OpenCL
OpenCL, an industry-standard framework for parallel programming across heterogeneous computing platforms, can unlock the compute capability of heterogeneous mobile devices, including Snapdragon mobile platforms. OpenCL fully harnesses the power of Adreno GPUs, enabling efficient and high-performance computation across various mobile applications. Below are some notable aspects and facts around OpenCL and Adreno:
1. Adreno GPUs and OpenCL 3.0:
- Adreno GPUs fully support the latest OpenCL 3.0 API, including many optional features through Khronos’ KHR/EXT extensions.
- Qualcomm Technologies has actively participated in the Khronos OpenCL standardization process since its inception and remains committed to OpenCL.
2. Innovation and Vendor Extensions:
- We continue to innovate and contribute to OpenCL development.
- Besides supporting various KHR/EXT OpenCL extensions, Qualcomm Technologies has introduced powerful vendor extensions.
- Notably, features like the Recordable Command Queue and on-chip global memory, significantly enhance the processing efficiency and boost power savings for many workloads, such as frame-based video processing.
A key advantage of the Adreno GPU is its unified memory architecture, where the CPU and GPU share a single address space. This eliminates data transfer bottlenecks, allowing compute tasks like the ALF to operate directly on the shared memory without memory transfers. For more details, consult the OpenCL Programming guide and the Adreno OpenCL SDK examples.
Why Offload the ALF to the GPU?
GPUs excel in scenarios involving massive data input, high parallelism, and tolerance for latency. This includes machine learning tasks involving extensive convolutions, matrix multiplications, and signal processing (such as FFT, filtering, and denoising).
However, ALF is not inherently the best fit for GPU parallel processing. ALF faces challenges due to divergences from factors like location classes, block categorizations, and diamond-shaped filters. Despite this, ALF remains one of the more intuitive blocks that can be offloaded to the GPU, compared to many other blocks in VVC. However, this doesn’t preclude other blocks, such as intra-block decoding, from benefiting from GPU acceleration. In fact, multiple blocks have already been offloaded to the GPU within the decoder.
For deeper insights, refer to this whitepaper, which covers workload partitioning, overhead handling between CPU and GPU synchronization, and other critical design decisions.
How to Squeeze the Best Performance out of Adreno GPUs
OpenCL code is generally portable across different vendors and generations of GPU with consistent versions and features. However, its performance is implementation-dependent, as it relies heavily on the underlying architecture and hardware.
This white paper focuses on ALF optimizations to achieve peak performance with the Adreno GPU in the Snapdragon 8 Gen 2 Mobile Platform. The white paper provides a holistic view including an algorithmic perspective, understanding data flow, and examining the precision sensitivity at each stage. The white paper serves as a valuable illustration for offloading and optimizing complex OpenCL use cases.
Summary
The success of offloading ALF onto Adreno demonstrates the application of the Adreno GPU for general purpose use cases, in addition to traditional graphics rendering. To learn more about OpenCL optimization for Adreno, refer to the OpenCL programming guide and best practices on Adreno GPUs.
Acknowledgement
- This achievement is the result of a collaborative effort between the Qualcomm Technologies GPU team and Tencent’s VVC team.
- Qualcomm GPU Research Team: Jeng-Hau Lin
- Tencent Media Lab: Xuerui Ma, Yu Guo, Tong Ouyang, Yuchen Li, Yiming Li

