One of the key components of an Augmented Reality system is object detection. Successful object detection returns the identifiers of the objects recognized in a camera frame, as well as the camera’s location and orientation with respect to each one of the identified objects. Using information, the application can render augmentations on top of the camera frame. Upon successful detection, the identity and pose information is passed on to a tracking subsystem to manage augmentation rendering in subsequent frames.
Our AR team has developed a feature-based object detection solution from the ground up.The system is based on looking for scale invariant features and histograms of gradient based descriptors, with significant modifications, aimed at complexity reduction. As a result, our solution yields detection performance comparable to the current state of the art algorithms, yet at a complexity level which makes it practical for implementation on mobile devices.
The database used for detecting an object is extracted from "reference views" of the object. In case of planar objects, a typical choice for a reference view would be a frontal, full-resolution image of the planar object. Additional views can also be added to provide greater robustness to view-point changes. In this special case, the object geometry is also fairly simple, since all the reference features lie on a single plane.
In the case of 3D objects, there can be several reference views describing the object where features from different views need to be described in 3D coordinates consistent across all views. Qualcomm’s AR team has developed two alternative pipelines for extracting 3D object representations, also known as "3D object databases” depending on whether a 3D model of the object at hand is known or unknown. In this situation, the object geometry verification is determined by estimating a consistent rotation and translation that can explain the observed correspondences.
If the 3D model is given, features extracted from multiple object views are projected forward and their depth is estimated at the intersection with the model positioned consistent with the camera pose of the view in question. However, if the 3D model is not available, which is most often the case, the problem becomes much more difficult. To resolve this, we have developed a specialized SLAM-based scanning tool which extracts object features from a multitude of views and estimates their 3D coordinates using standard structure-from-motion techniques.