Amidst fierce competition from 70 international teams from academia and industry, including Google, Microsoft, Tencent and the Korea Advanced Institute of Science and Technology, Qualcomm Research has been a consistent top-3 performer in the 2015 ImageNet challenges for object localization, object detection and scene classification.
ImageNet is the premier benchmark to measure progress in deep learning for computer vision organized annually by computer scientists at Stanford University, the University of Michigan and the University of North Carolina at Chapel Hill. It requires that computer systems created by the participating teams automatically recognize what objects or scenes appear in millions of digital photographs.
Since a deep learning team led by Geoffrey Hinton from University of Toronto caused a smashing victory in the ImageNet competition in 2012, state-of-the-art object and scene recognition technology has been based on deep convolutional neural networks. These deep nets learn the image representation and classification results simultaneously by back-propagating information, through stacked convolution and pooling layers, loosely inspired by current hypotheses on how the human brain sees. While the principle of convolutional neural networks has been around for a long time, learning its many parameters has become viable thanks to breakthroughs in parallel computing using graphics processing units and databases containing a large number of labeled examples, such as the 14 million in the ImageNet collection.
Like others, the methodology used by Qualcomm Research (the R&D division of Qualcomm Technologies, Inc.) starts with very deep convolutional neural networks, parallel computation and large amounts of training data. However, the team innovates on its approach in how the deep learning system learns the correct parameters to detect and classify objects in images. Different from traditional deep learning solutions, which all learn the network parameters using an entire image as its input, our system is object centric. This means it is deriving the parameters of the neural network using only those regions in the image that are relevant for the recognition, while irrelevant pixels in the background are discarded. By doing so, the image representations are better suited to recognize and localize objects, while it also provides better initializations for networks specializing in recognizing scenes. Object-centric training forms the foundation for our entries in the object localization, object detection and scene classification challenges.
The goal in the object localization challenge is to recognize what object is localized where in an image. A total of 1,000 objects are predefined, including bagels, sombreros and traffic lights. For each image, known to contain one object, a system may make five guesses on what object appears in the image and also predict a tight bounding box for each of them. If the predicted object category resembles the object of interest and the accompanying box has a sufficient overlap with a manually created ground truth, a prediction is considered correct. Last year’s winner of the task, University of Oxford, predicted the object and its location correct in 74.7% of the 100,000 test images, Qualcomm Research top-3 entry improved it further to 87.4%, where the winner scored 91%.
Object detection is similar in spirit to the object localization task. Again the task is to classify what object is located where in the image. However, different from the localization task that strives to find a single object, here the task is to localize every object in the image. The object categories are taken from a list of 200, including jellyfish, microwave, and wine bottle. The quality of labeling is evaluated by balancing the number of accurately detected objects and the number of false positives. Last year’s winner of the task, Google, obtained a score of 43.9%, Qualcomm Research’s top-2 entry improved it further to 53.6%, where winner Microsoft Research Asia scored 62.1%.
The goal in the Places2 challenge, a competition organized by researchers from MIT, is to recognize what scene is depicted in an image. A total of 401 scenes are predefined, including freeway, hardware shop, and science museum. Often, the differences between two scenes are subtle, such as between a home bedroom and a hotel bedroom, making the task challenging. Again, each system may predict up to five scene categories. The winning entry for this challenge scored an accuracy of 83.1%, while Qualcomm Research obtained a top-3 position with 82.4%.
The work of Qualcomm Research will be presented by Daniel Fontijne and Koen van de Sande at the ImageNet workshop of the International Conference on Computer Vision on December 17 in Santiago, Chile.