Aug 19, 2019
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
As artificial intelligence (AI) emerges as one of the most exciting spaces in technology, there’s a particular area researchers have their eyes on: computer vision. The goal is for machines to “see” and understand visuals similar to how humans do, which could unlock a potentially limitless number of user experiences. One approach in particular, deep vision, is really making waves.
With this form of machine learning that uses deep neural networks to process information, the team at QUVA Lab — a public-private collaboration between Qualcomm Technologies and the University of Amsterdam — is making advances in the field as well as AI overall. As demonstrated in the team’s research paper, it accomplished an incredibly difficult machine learning task: pixel-level segmentation of an actor and its action in video from a natural language sentence input. By using sentences as input, this research opens the opportunity of segmenting videos beyond a limited vocabulary that is traditionally used for video segmentation.
For this project, QUVA Lab researchers used 3,782 YouTube videos with pixel-level annotation, which they annotated with sentences. Then they fed the video and sentence pairs into their training model, and it used this dataset to learn to predict segmentation maps with much success. In simpler terms, the model was able to accurately define exactly which pixels in a video comprise the person and action pictured (a woman running, for example), informed by a natural language sentence describing the scene.
Previously, video segmentation was confined to predefined classes, and this project showed a path for moving beyond this traditional, limited method.
To learn more about this exciting project and our innovative research to help improve the way the world computes, connects, and communicates, we chatted with Amir Ghodrati, who has completed his tenure as a researcher at the QUVA lab and then joined as senior engineer at Qualcomm Technologies Netherlands, B.V.
What were the results of the research? Did any of the findings surprise you?
We always expect to be surprised, sometimes with amazing findings and other times with disappointing results. We’re thrilled that this time around it was the former.
We were aware of research lines on video segmentation and text-based tracking when we started this project, but we knew no research group had worked on video segmentation with natural language as an input. It’s an extremely difficult task. However, based on our observations and experiences, our team guessed it might be feasible to tackle. After all our brainstorming, data annotations, and training different neural network architectures, we made something that was working better than other methods in the field. That was a sweet moment. And when our paper was accepted as an oral presentation at The Conference on Computer Vision and Pattern Recognition (CVPR), the most prestigious computer vision conference, we really felt like our hard work paid off.
Why was this previously so difficult to achieve?
First, you need access to pixel-level annotations to train a segmentation model, but obtaining these at a large scale is expensive and time consuming. Without enough data, you need to transfer knowledge from other tasks like classification, which is what we did to encode each video. Another factor is that there isn’t a straightforward solution for fusing textual and visual modalities. Simple methods aren’t elegant and, in many cases, the model can’t fully exploit information from both sides. We overcame this by using dynamic convolutional layers to allow for dynamic adaptation of visual filters based on the input sentence. In this way, we were able to jointly learn from both text and video in an end-to-end fashion.
What are the potential applications of this method you developed?
This method could assist in finding a specific actor and action in real time during daily activities. This would be especially useful for the visually impaired, who might interact with AR glasses through a voice UI. For example, it could help a user find a person or object in a crowded setting, such as finding a friend waving to you in a stadium, by highlighting it on the AR display.
There’s also potential for visual search by a query of existing video files. In the context of a film or news video archive, a visual search system could employ the method for finding and segmenting intended action, such as gathering footage of the moon landing. It could even help you find a video from your family vacation without sorting by date.
This method could also be used for more efficient and accurate video annotation. Currently, this process is very time-consuming and error-prone, and often done by humans. Allowing the annotator to describe video with sentences, rather than choosing from a set of predefined classes, would increase accuracy and help solve a major issue in machine learning, which is having enough data to train with.
Additionally, for drones and autonomous vehicles that use segmentation maps for searching the world, this method could be used to provide understanding of their surroundings. For example, the technology could be used by drones to find actions that are spoken from the command room, such as finding a lost hiker.
What do your findings mean for the advancement of computer vision?
The first and foremost contribution of this project is its introduction of the new task involving actor and action video segmentation from a sentence. Second, the end-to-end trainable model we proposed demonstrated the quality of the sentence-guided segmentations, the generalization ability of our model, and its advantage for traditional actor and action segmentation. And to make this all happen, we extended two existing datasets with more than 7,500 sentence-level annotations, describing the actors and actions appearing in the video content. We made these annotations publicly available for research purposes, which will further work in computer vision.
Where do you think deep vision and AI overall will be in five years? In 10?
I don’t want to give too many predictions for this tricky question, but I expect that we’ll still be working on building machines with high prediction power. During my time in this field, I’ve learned that the progress in computer vision is not always as you predict. Like I said, we expect to be surprised.
I will say that in the next five years, one of the most exciting areas in computer vision will be fine-grained understanding of videos and text with minimum supervision. We already have narrow AI that is limited to a single task: It makes autocomplete suggestions in email, and recommends products, events, books, and more. I think in the next five to 10 years, narrow artificial intelligence will integrate into our lives and society more densely. However, we’ll be less likely to have general AI in this time frame.