OnQ Blog

Adreno Graphics Tools -- Make Better Apps On Snapdragon and the Adreno GPU

12 Sep 2011

Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.

Developing and optimizing 3D games and apps for devices with Snapdragon and its integrated Adreno GPU (Graphics Processing Unit) is easy with Qualcomm's advanced Adreno mobile development tools. Developers can save time and money by using the Adreno tools to optimize performance and speed up development.

Engage with us on

and

Manish Sirdeshmukh

Senior Product Manager

More articles from this author

About this author

Related News

OnQ

Heterogeneous Computing: An architecture and a technique

If you’re looking to create great mobile experiences, optimization isn’t optional: it’s a crucial step that helps transform good ideas into great execution. In our previous “Start Cooking with Heterogeneous Computing Tools on QDN” blog, we discussed the concept of heterogeneous computing and how it can help you get more from mobile hardware by sending computational tasks to the best suited processor. Heterogeneous computing is designed to help you achieve better application performance while improving thermal and power efficiency.

However, not all systems capable of heterogeneous computing are created equal and it’s important to understand why. Heterogeneous computing is both a computational technique and a hardware architecture. To achieve greater benefits, you are better served with hardware architected for heterogeneous computing from the ground up along with a software stack that facilitates heterogeneous computing techniques. It’s the combination of purpose-built hardware and a software stack offering granular control within a larger framework of system abstraction that allows for the deep optimizations that heterogeneous computing can deliver.

The Qualcomm Snapdragon Mobile Platform is designed on these principles. This starts with the microarchitecture – the choices made in platform circuitry that include how individual processors are engineered for high performance and how to optimize compute paths between the processors. Let’s look at the main components of the Snapdragon mobile platform and a few of the microarchitecture considerations that went into its design:

Qualcomm Kryo 280 CPU

Designed to handle complex workloads like web browsing and in-game artificial intelligence, the Kryo 280 features an octa-core processor with independent high efficiency and high performance core clusters. During normal operation, the high efficiency cores run most tasks while the high-performance cores activate for anything needing more power.

Qualcomm Hexagon 682 DSP

With the Hexagon wide Vector eXtensions (HVX), the Hexagon DSP excels at applications requiring heavy vector data processing, such as 6-DOF (or Degrees of Freedom) head motion tracking for virtual reality, image processing, and neural network computations. With a 1024-bit instruction word capability and dual execution of the control code processor and the computational code processor within the DSP, Hexagon can achieve breakthrough performance without draining system power.

Qualcomm Adreno 540 GPU

Ideal for arithmetic-heavy workloads that require substantial, parallel number crunching like 3D graphics rendering and camcorder image stabilization, the Adreno GPU is engineered to achieve improved power efficiency and 40% better performance than predecessors. Designed to deliver up to 25% faster graphics rendering and 60x more display colors compared to previous designs, the Adreno GPU supports real-life-quality visuals, and can perform stunning visual display feats like stitching together 4K 360 video in real time.

Heterogeneous computing in microarchitecture design

Beyond the performance enhancements among the individual processors, the Snapdragon mobile platform was designed to optimize the use of the processors together. For example, the Hexagon DSP can bypass DDR memory and the associated data shuffling CPU cycles by streaming data directly from sensors to the DSP cache. Similarly, the Adreno GPU supports 64-bit virtual addressing, allowing for shared virtual memory (SVM) and efficient co-processing with the Kryo CPU. These are just two of the microarchitecture design choices in the Snapdragon mobile platform that make it cutting-edge for heterogeneous computing.

Software

As we noted at the beginning of this post, heterogeneous computing is also a technique. And to truly support heterogeneous computing requires a software stack that provides developers the abstractions and the control to leverage the optimizations in the hardware per the requirements of their application.

To program the DSP or the GPU for heterogeneous computation, and to maximize their performance, developers can use the Qualcomm Hexagon SDK and the Qualcomm Adreno SDK, respectively. These SDKs open a toolbox of controls allowing for precision manipulation of data and computational resources.

For system-wide heterogeneous computing control, Qualcomm Symphony system manager SDK provides the software utilities designed to achieve better performance and lower power consumption from the Snapdragon mobile platform. Symphony is designed to manage the entire platform in different configurations so that the most efficient and effective combination of processors and specialized cores are chosen to get the job done as quickly as possible, with minimal power consumption.

On top of these SDKs it is possible for developers to build their applications directly – many developers opt for this route. However, there is a growing ecosystem of SDKs, frameworks and supporting libraries for accelerating development within a given application domain. Two examples of this are QDN's Adreno SDK for Vulkan for the Vulkan graphics API and our recently released Snapdragon VR SDK.

How to Put Heterogeneous Computing Techniques into Practice with Tools from Qualcomm Developer Network

The Snapdragon mobile platform’s microarchitecture, software development kits and high-level frameworks layer upon each other to provide developers ultimate control of their application performance and power usage. This combination of hardware purposely designed for heterogeneous computing and software allowing the use of heterogeneous computing techniques is at the heart of delivering great user experiences.

Whether you’re a developer that has always written code that executes serially and are interested in how to push more work to other processors, or whether you already have heterogeneous computing know-how and want to use it more effectively – perhaps wanting precise control over the dynamic distribution of your workloads – visit QDN for the resources you need to level up your heterogeneous computing skills.

 

 

23 Mar 2017
OnQ

Profiling VR apps for better performance

High performance in virtual reality apps is important to users, but we developers need to be smart about how we deliver it.

Does it surprise you to learn that 96% of mobile users consider performance important? If it’s that important to users, then it should be that important to us. Even if performance isn’t our highest priority, as developers we have a responsibility to make sure we’re not running pesky background services and draining the battery needlessly. And as VR hardware becomes more accessible, our apps shouldn’t cause a headset to overheat.

Qualcomm® Snapdragon™ Profiler allows you to analyze CPU, GPU, DSP, memory, power, thermal and network data so you can find and fix the performance bottlenecks in your apps that waste power and generate heat. Along with tools in our Symphony™ SDK and Adreno™ SDK, Snapdragon Profiler is a step on the path to heterogeneous programming and a valuable tool in building VR applications. It shows you how your code affects different cores and resources on the Snapdragon processor.

In this post I’ll show you three ways you can use Snapdragon Profiler to examine what’s happening inside your app. In my next post I’ll describe three worst practices of VR programming and show you how to profile them.

When your app is too hot

First, here’s an example, brought to you by mobile devices that get uncomfortably warm.

We’ve all read and seen what happens when a phone runs too hot. Temperature matters, whether it accumulates in the battery, the memory or the computing cores. That’s why Snapdragon Profiler shows you how your app affects temperature in the test device and how temperature affects compute power.

The screenshot below comes from Snapdragon Profiler running on my desktop while connected via USB to a very CPU-intensive app running on my test device.

The operating temperature of the CPU cores could stay at 150°F/65°C for only so long before frequency collapse, when the system scheduler and power monitors told the cores to run at a lower clock frequency so the device could cool off. You can see that CPU 0/1/2/3 Frequency and GPU % Utilization fell off like a rock for about one minute, then gradually inched back up.

If that happened in your VR app, how do you think it would affect user experience? Performance too would fall off like a rock. Your app would struggle to render frames as before, but at a lower clock frequency and a lower frame rate. The hardware wouldn’t be able to keep up. Your users would one-star you to death.

Three modes for profiling Snapdragon

That’s the kind of information Snapdragon Profiler gives you about your app. It offers three different modes for exploring app performance.

  • Real time – The screenshot above shows profiling in Snapdragon Profiler’s real-time mode, with my development machine connected to the test device via ADB over USB (or Wi-Fi). You can select running apps, services and widgets and see how they affect CPU, GPU, memory, network and thermal profile.
  • Trace capture –  To display kernel and system performance over time, trace mode captures events at a high sample rate. In the screenshot below, Snapdragon Profiler shows DSP metrics and driver activity as surfaces are rendered in a VR app.
  • Snapshot – Analyze textures at the frame and buffer level in snapshot mode. You can see how your scene is being constructed with OpenGL ES: draw calls, object attributes, shader code and pixel history. In the partial snapshot below I’ve indicated the draw calls for right eye and left eye in a VR app.

Snapdragon Profiler works with all devices powered by Snapdragon processors. That includes commercial and development boards such as the DragonBoard 410c. (Expect limited functionality on non-Snapdragon devices.) The more recent the device and version of Android, the more valuable the information you can get from Snapdragon Profiler.

Next steps

Snapdragon Profiler is available now, waiting for you to download and install.

  • Check system requirements. We’ve made them pretty undemanding so you can start profiling right away.
  • Our FAQ covers installation, capabilities and troubleshooting.
  • We’ve put together a 6-minute video about using the profiler, featuring yours truly.

In the world of VR apps, Snapdragon Profiler can help you make the most of the 16ms you have (well, really about 12-14ms) for rendering each frame for both eyes. I’ll cover that in my next post.

And if you plan to attend AnDevCon, stop by my session called Profiling VR Games and Applications for Optimum Performance. Bring plenty of questions.

 

 

15 Dec 2016
OnQ

Matrix multiply on Adreno GPUs – Part 1: OpenCL optimization

The matrix multiply (MM) operation has become very popular on GPUs thanks to recent interest in deep learning, which depends on convolutions. We’ve been hearing from developers who want to accelerate deep learning (DL) applications on Qualcomm Snapdragon processors with Adreno GPUs.

This is a two-part guest post by Vladislav Shimanskiy, one of our Adreno engineers. The concepts in this post and the OpenCL code listings you’ll see in my next post represent an optimized implementation of device-side matrix multiply kernels and host-side reference code for Adreno 4xx and 5xx GPU families. We hope this series will help and encourage you to write your own OpenCL code using the ideas and code samples.

Vlad Shimanskiy is a senior staff engineer in the GPU Compute Solutions team at Qualcomm. He has been working on development and prototyping the new OpenCL 2.x standard features on Snapdragon, improvement of Adreno GPU architecture for compute and acceleration of important linear algebra algorithms, including the matrix multiplication on the GPU.

Parallel computing processors like the Adreno GPU are ideal for accelerating linear algebra operations. However, the MM algorithm is unique among intensively parallel problems in that it requires a great deal of data sharing between individual computing work-items. In the matrices to be multiplied — say, A and B — each element contributes many times to different components of resulting matrix C. So optimizing an MM algorithm for Adreno requires that we take advantage of the GPU memory subsystem.

What’s difficult about matrix multiplication on the GPU?

When we attempt to accelerate MM on the GPU, the data sharing problem noted above breaks down into several related problems:

  • MM operates on the same values repeatedly, but the larger the matrix, the more likely that we must go out to (slow) memory for values that we've had to replace in the cache, which is inefficient.
  • In a naïve implementation of MM, it would be natural to map scalar matrix elements to separate work-items. However, reading and writing scalars is inefficient because the memory subsystem and Arithmetic Logic Units (ALUs) on the GPU are optimized for vector operations.
  • Loading elements of large matrices A and B at the same time contributes to the risk of conflicts in caches and contention in memory buses.
  • Memory copying is slow, so we need a better way to make data visible to both CPU and GPU.

These problems complicate the main tasks of MM: reading the same values many times and sharing data.

OpenCL optimization techniques for matrix multiplication

We’ve specified an OpenCL implementation that includes techniques to address each of the problems.

1. Tiling

The first well-known problem is to minimize repetitive reading of the same matrix elements from slow memories, such as higher-level caches and DDR. We have to try to group memory accesses (reads and writes) so that they are close to one another in the address space.

Our technique for improving data re-use is to split input and output matrices into sub-matrices called tiles. We then enforce the order of memory operations so that the dot products resulting from matrix multiplication are partially completed in the entire tile before we move reading pointers outside of the tile boundaries.

Our algorithm recognizes two levels of tiling: micro-tiles and macro-tiles. The following figure represents how we map the matrices to multiply components in matrix A by components in matrix B and arrive at the single dot product in matrix C:

The micro-tile — here, {dx, dy} — is a rectangular area inside a matrix that is processed by a single work-item in the kernel. Each work-item is a single thread within a SIMD sub-group that in turn forms an OpenCL work-group. A typical micro-tile has 4 x 8 = 32 components called pixels.

The macro-tile — here, {wg_size_x, wg_size_y} — is generally a bigger, rectangular area consisting of one or more micro-tiles and corresponding to a work-group. Within the work-group, we operate entirely within the macro-tile.

To calculate the 4x8 micro-tile in matrix C, we focus on areas in matrices A and B that have sizes 4x8 and 4x4 respectively. We start at pos = 0, calculate a partial result, or dot product, and store it in a temporal buffer for that micro-tile. Meanwhile, the other work-items in the same macro-tile calculate their partial results in parallel, using the same data loaded from matrix A or matrix B. All data in the row coming from matrix A gets shared. Likewise, all data in the column coming from matrix B gets shared between the work-items in the same column.

We calculate partial results for all the micro-tiles in the macro-tile, then we increment pos horizontally in A and vertically in B at the same time. By doing tile-focused calculations and keeping the pos incrementing gradually, we can maximize reuse of data already in the caches. The micro-tile continues to accumulate, or convolve, partial results that add up to the dot product.

So we limit ourselves to all the positions inside the macro-tile and finish all the partial calculations before we move position. We could complete the entire micro-tile, swiping through pos from left to right and top to bottom, and then advance, but that’s inefficient because we need the same data, which the cache has since evicted. The point is that we’re working in an area limited by the work-group with a number of work-items progressing concurrently. This approach guarantees that all memory requests from the parallel work-items are issued within the bounded address area.

Tiling optimizes the operation by focusing on a particular area in memory — a work-group — so that we can work in a cache-friendly manner. That is vastly more efficient than jumping across large swaths of memory and having to go out to DDR for values that are no longer in the cache.

2. Vectorization

Since the memory subsystem is optimized in hardware for vector operations, it is better to operate with vectors of data than with scalars, and to have each work-item deal with a micro-tile and a full vector. As a result, we use all values obtained in each vector-read operation.

For example, in the case of 32-bit floating point matrices, our kernels are written to use vectors of float4 type instead of just float. That way, if we want to read something from the matrix, we read not just a single floating point component of the matrix, but an entire block of data. That’s important because it matches how buses are designed, so we read components from the matrix in chunks of 4 elements and saturate the bandwidth to memory. Correspondingly, micro-tile dimensions are all multiples of 4.

If we were working on the CPU, we would likely read a 2-D array one scalar element at a time, but OpenCL on the GPU offers a better way. To make the reads and writes efficient, we operate by variables of data type float4, or multiples of float4.

3. Texture Pipe

Using independent caching (L2 direct and Texture Pipe/L1) for two matrices, as shown in the figure below, allows us to avoid most of the contention and parallelize read operations so that the data from matrix A and matrix B gets loaded at the same time. Involving L1 helps to substantially reduce read traffic to L2.

In Adreno as in many other GPUs, each compute unit has independent connections to a texture pipe (TP) unit. The TP has its own L1 cache and an independent connection to the L2 cache.

The trick we use to increase the bandwidth is to load one matrix through TP and the other through the direct load/store pipe. Because we’re reusing components so much in matrix multiplication, we get the advantage of the L1 cache as well. We end up having much higher traffic from TP/L1 to the compute unit than from L2 to L1. That block reduces the traffic significantly. If, instead of the TP, we had just another connection to L2, it wouldn't help much because we’d have lots of contention and arbitration between the two buses.

The result is a great deal of traffic on the direct connection and very little from TP/L1 to L2. That helps us to increase total memory bandwidth, balance the ALU operations and achieve much higher performance. We spend nearly as much time on the ALU operation as we spend waiting for data to come back from caches, and we can pipeline them so that neither becomes a bottleneck.

4. Memory Copy Prevention

Our OpenCL implementation has two components: the kernel running on the GPU, and the host code running on the CPU and controlling execution of the kernel. If we implement a GPU-accelerated library such as BLAS to multiply matrices, then the input matrices will be in CPU virtual memory space and the result of the multiplication has to be available in CPU memory as well. To accelerate the matrix multiply on GPU, the matrices must first be transferred to GPU memory.

The traditional way of doing this is to copy the matrices into the GPU address space, let the GPU perform its calculations, then copy the results back to the CPU. However, the time it takes to copy big matrices can be a significant fraction of the total time to compute on the GPU, so we want to avoid the inefficiency of memory copying by the CPU. Here, the Adreno GPU has the advantage of shared memory hardware on the Snapdragon processor that we can use instead of copying memory explicitly.

Why not simply allocate memory that is automatically shared between CPU and GPU? Unfortunately, it doesn’t work that way, because we need to address constraints like alignment. Only if the allocation is done properly with OpenCL driver routines can shared memory be used.

Results

The graph below depicts the performance increase across Adreno versions for Single-Precision General Matrix Multiply (SGEMM):

The graph is based on data from common floating point operations. Other MM kernels using different data types (8-bit, 16-bit, fixed point, etc.) can be efficiently implemented following the same principles we apply for SGEMM.

In general, our implementation of MM optimized for the Adreno GPU is at least two orders of magnitude faster than the naïve implementation.

What’s next?

In my next post I’ll give you the OpenCL code listings behind these concepts.

Matrix multiplication is an important basic linear algebra operation in convolutional neural networks. In particular, the performance of DL algorithms is tied to MM because all variations of convolution in DL can be reduced to multiplying matrices.

The concepts described above and the code you’ll see in the next post are not the only way to do convolutions. But the fact is that many popular DL frameworks like Caffe, Theano and Google’s TensorFlow often reduce convolution operations to MM, so it’s a good idea to follow the momentum in that direction. Stay tuned for code samples in part 2.

 

 

10 Oct 2016