Jan 19, 2021
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
The saying goes that a picture is worth a thousand words, so what does that imply for video? Video, which is essentially a sequence of static pictures, adds a temporal element and more context. Video perception, which means analyzing and understanding video content, with AI can provide valuable insights and capabilities for many applications ranging from autonomous driving and smart cameras to smartphones and extended reality. For example, autonomous driving uses video from multiple cameras for a variety of crucial tasks, including pedestrian, lane, and vehicle detection. Video perception is crucial for understanding the world and making devices smarter.
When talking about AI, people often ask if there’s enough data. The answer is a definitive yes. Video data is abundant and being generated at ever-increasing rates. In fact, video is all around us, providing entertainment, enhancing collaboration, and transforming industries. The scale of video being created and consumed is massive — consider that close to 1 million minutes of video crosses the internet per second. However, only a tiny portion of this vast amount of video data, a drop in the ocean, is annotated for supervised learning. This motivates solutions that leverage unsupervised and semi-supervised learning.
Compute efficiency is essential for ubiquitous video perception
So, if video data is readily available and video analysis provides valuable information, why isn’t AI being used more often for video perception? In my webinar, I’ll go into a number of data and implementation challenges, but the problem area that I want to focus on for this blog post is computational efficiency. As video resolution and frame rates increase while AI video perception models become more complex to improve accuracy, running these workloads in real time is becoming more challenging. Adding to the challenge is we want to run video perception on a diverse set of devices that often have power, thermal, compute, and memory constraints. Processing data closer to the source through on-device AI is important since it offers crucial benefits such as privacy, personalization, and reliability, in addition to helping scale intelligence.
Efficiently running on-device video perception without sacrificing accuracy
At Qualcomm AI Research, our research goal for video perception is to achieve efficient solutions while maintaining and improving neural network model accuracy. Rather than doing brute force calculations, we try to remove unnecessary computations that do not degrade accuracy. Removing unnecessary calculations generally improves performance, lowers memory consumption, and saves power. The motivations for efficient video perception techniques that we’ve developed are centered around two key concepts: leveraging temporal redundancy and making early decisions.
Leveraging temporal redundancy to reduce computations across frames
Leveraging temporal redundancy means to take advantage of the fact that video frames are heavily correlated. The difference between two consecutive video frames is often minimal and contains little new information in most regions, so it is often unnecessary to analyze the entire image. We want to limit the computation only to the regions where there are significant changes. Learning to skip regions and recycling features are two novel techniques we’ve developed to take advantage of temporal redundancy in video.
For the learning to skip regions, we developed skip-convolutions for convolutional neural networks (CNNs). We introduce a skip-gate into a convolutional layer of a neural network to skip computations when the differences between the current and previous frame input features are negligible. The skip-gate itself is a tiny neural network that is trainable and computationally efficient. The net result is that the neural network learns to skip unnecessary computations while maintaining accuracy. For example, our skip-convolution technique applied to state-of-the-art object detection models has resulted in 3x-5x speed-up over state-of-the-art models without sacrificing model accuracy. What’s also noteworthy is that skip convolutions are broadly applicable and can replace convolutional layers in any CNN for video applications.
The recycling features technique computes features once and uses them later rather than computing deep features of the neural network repetitively. The intuition behind this is that the deep features remain relatively stationary over time while shallow features contain the temporally varying information. Recycling features is applicable to any video neural network architectures, including segmentation, optical flow, classification, and more. On a semantic segmentation example, we saw a 78% reduction in computation and a 65% reduction in latency by using feature recycling. In addition, we saw a dramatic reduction in memory traffic, which significantly saves power.
Making early decisions to reduce computation
Making early decisions attempts to make easy decisions early by dynamically changing the network architecture per input frame. Early decisions, in essence, allow us to skip computation that is unnecessary for maintaining accuracy. Early exiting and frame exiting are two techniques that take advantage of making early decisions.
Early exiting exploits the fact that not all input examples require models of the same complexity to maintain accuracy. For complex input examples, very large models that are usually compute-intensive are needed to classify correctly. However, for simple input examples, very small and compact models can achieve very high accuracies, while only failing for complex examples. To take advantage of this, our neural network should be composed of a cascade of classifiers throughout the network. To make the early exit decision, we gate based on temporal agreement and frame complexity. Early exiting reduces compute while maintaining accuracy. For an object classifying example, exiting at the earliest possible neural network layer resulted in a 2.5X reduction in computations while maintaining accuracy.
Frame exiting uses a similar gating concept but attempts to skip computations on an entire input frame by making early decisions. For action recognition tasks, frame exiting not only reduces compute but also improves the accuracy of the model. By adding gates to the neural network architecture, deeper layers concentrate on the difficult decisions while earlier layers solve all the easy issues. This gating method also allows us to train models that tradeoff between accuracy and efficiency, allowing AI developers to customize the model for the use case requirements.
Looking beyond efficient video perception
Looking forward, our future research in video perception will look to advance existing efficiency techniques I discussed above while also developing new conditional compute solutions. We are bringing personalized processing, multi-task learning, sparse convolutions, unsupervised and semi-supervised approaches, quantization-aware training, and platform optimizations into our designs. In addition, our perception research is much broader than video. Besides video, we are driving high-impact machine learning and computer vision research efforts and inventing technology enablers in several areas of perception, from 3D and RF sensing to personalization and biometrics. We are focused on enabling advanced use cases for important applications, including XR, camera, mobile, autonomous driving, IOT, and much more. I look forward to a future with much more perceptive devices that enhance our everyday lives.