Numerous industries are developing deep learning-based approaches to solve problems pertaining to machine perception spanning imaging, video, audio and language understanding. These applications range from smartphones and automotive to a variety of IoT devices (robotics, home automation, etc.) Deep neural networks (DNNs) typically are both memory and compute-intensive. This puts them at direct odds with the power and performance constraints of embedded devices, necessitating algorithmic and software optimizations to make DNNs suitable for deployment at the edge. They must perform within a powered-constrained, compute-constrained, restricted environment while maintaining the performance the networks would normally achieve in the cloud or even on a desktop PC.
Key Research Areas:
We are applying our deep learning expertise across all aspects of DNNs including algorithms, software frameworks, and hardware architecture to enable cutting-edge DNNs for deployment in automotive, robotics, IP cameras, audio, IoT, and other domains at the edge. By optimizing our Snapdragon™ System on Chip (SoC) platforms, we enable highly efficient execution of these networks.
Our algorithmic research focuses on two aspects. First, we have a deep understanding of leading DNN architectures for solving leading edge problems in computer vision, audio, and speech. Our aim is to understand how best to map those to our Snapdragon SoC within our customer's computational requirements while maintaining high performance. The second aspect involves understanding approaches which apply across DNN types, focusing on optimization for embedded applications, including bit width reduction, compression, and sparsity techniques.
Two leading examples of our algorithmic research are innovations in highly sophisticated DNN models such as the Semantic Segmentation network and the Single Shot Multibox Detector (SSD) network. We evolved both networks with our own optimization techniques to better fit them within the power and performance constraints for embedded devices such as smartphones, drones, virtual reality headsets, and automobiles. We carefully analyze existing bottlenecks in these networks and employ optimization techniques, along with optimizing changes in the structure of these networks, to significantly improve their efficiency while maintaining their accuracy.
In 2014, we were the first to demonstrate that state-of-the-art neural networks could be run on existing mobile devices with respectable performance. We efficiently manage the available resources on our SoC, maximizing the computational performance of our three compute cores – Central Processing Unit (CPU), Graphical Processing Unit (GPU), and Digital Signal Processor (DSP) – within our SoC via heterogeneous computing techniques. For example, we could dispatch a DNN to run a DNN on the DSP and another on the GPU simultaneously, or even switch the DNNs between the DSP and GPU on the fly. Approaches for even more robust and granular computation of DNNs across the various Snapdragon SoC compute cores is a key area of ongoing research.
There are many different neutral network training frameworks that are popular with the artificial intelligence community. Developers can continue to use the framework of their choosing. We are building converter tools for the most popular software frameworks including Caffe, TensorFlow and Torch, so that DNNs created in these frameworks can easily run on Snapdragon. This will enable customers and researchers to transition their network onto our SoC with minimum developmental turnaround time.
The goal of our Semantic Segmentation network is to build a pixel level map by labelling every pixel in an image in order to categorize objects and generate an understanding of what’s happening in a scene. The approach involves taking in an image, encoding it into a feature space, then decoding it into a pixel map such that every pixel in the output is labeled for the class which it represents in the original image. The original image is essentially converted into a colored overlay such that the network produces an annotated map of colored pixels that is generated at the same resolution as the input image.
We engineered our SSD network to perform simultaneous object localization/classification, where tight bounding/marking boxes are placed around specific groups of pixels that signify objects of interest. SSD has similar architecture to Semantic Segmentation, as both networks can utilize the same encoding process. However, SSD’s decoding stage is completely different. It performs feature extraction not to produce a map of pixels, but rather to identify object(s) within a scene that the network has been trained to select. To do this, our network uses feature extraction maps that leverage different resolutions used for identifying objects of various size. For example, for high resolution small objects, our network would use highly encoded feature extraction maps. As our network maps the scene, it creates approximately 2,000 box proposals to identify the objects of interest, makes a confidence determination about which boxes are legitimate objects, selects them, and places bounding boxes around the most promising objects.
In October 2016, we successfully demonstrated our adaptation of the SSD network at ECCV 2016 (European Conference on Computer Vision) Trained on 16 classes of objects, our SSD ran on a Qualcomm Snapdragon 820-powered tablet in live view mode where it analyzed street scenes in Amsterdam, classifying and detecting cars, bicycles, people, and other objects in a visual scene. This example illustrated the power of the Snapdragon Neural Processing Engine and its abilities to process a very large network (5 billion+ operations for each pass of the network) and 5-6 FPS (~ 25 billion+ operations/second!)
To continually make advancements in cutting-edge DNNs, we leverage two significant university relations programs. In September 2016, we opened a joint research lab with the University of Amsterdam, focused on advancing next generation machine learning techniques for mobile computer vision. Combining on-device deep learning with computer vision will enhance intelligence in smartphone cameras, robotics, automotive and IoT applications. The goal for the deep vision program is for neural networks to automatically understand the meaning of what’s happening in images or videos (from a who, what, when, where, and why perspective) in order to make better decisions. For example, in an autonomous driving use case, if the cameras in the automobile detect a red traffic light, the network would know to cue the vehicle to stop.
We are also partnered with the University of California, Berkeley on their DeepDrive program, which is pioneering advancements in computer vision and deep learning for automotive applications, also known as deep automotive perception. By combining deep learning, computer vision, and vehicles, the DeepDrive program focuses on fundamental research, embedded/hardened implementation, and real-world demonstrations that focus on key research themes such as pedestrian detection, scene classification, and other areas that will make dramatic progress in computer vision.
If you find the work we're doing in DNNs to be exciting, and you have a technical background in deep learning, DNN modeling, or DNN software frameworks, we'd love to hear from you. Please visit us a www.qualcomm.com/company/careers to submit your resume.
Single Shot Multibox Detector (SSD) network
May 14, 2018