OnQ Blog

Computer vision powered by machine learning will change the way we see the world

Humans use their eyes and, yes, their brains to see. It’s the way we’re able to visually sense the world around us. One of the goals of computer vision is for machines to “see” and process images in the same way, a crucial capability for artificial intelligence. But it’s what computer vision can then do with these images that’s really exciting. Computers can see things that we can’t, with multiple channels of perception. Simply put, machines have greater visual capabilities than we do. Given their superhuman seeing power, the potential impact computer vision will have on our lives is profound. First, though, computers need to be trained. We need to gather sufficient data and an understanding of how computers actually see the world.

In this two-part series on AI, we’ll explore the current state of some AI models and then take a look at the work Qualcomm is doing to advance computer vision.

The lighter side of AI

Right now, when most people see computer vision in action, they’re likely seeing it in fun, playful ways. Ever seen a video demo of augmented reality glasses? You know, the ones that focus on wowing you with amazing visuals like projecting sharks swimming through your living room or creatures bursting through the wall. As cool as some of these effects are, there’s loads more computer vision can do.

And then there’s the Not Hotdog app, the cute inside joke of the tech world seen on HBO’s Silicon Valley. In season 4 of the show, which is about a group of guys who develop a tech startup (if you haven’t watched the show, you really should), they create an AI app that they hope will become the “Shazam of food.” When they test it, the app only identifies hotdogs; everything else is categorized as “not hotdog.” After the episode aired, the creators of the show actually developed the app for laughs.

You might argue that the app fills a great humanitarian need for those who were previously unable to identify what was and was not a hot dog. But even the least tech-savvy characters of the show were less than impressed with the app. The demonstration scene shows us that building a useful classifier is anything but simple. It also represents how people’s expectations of such classifiers can sometimes clash with the reality of what they can actually achieve.

It’s not just in the show that such complexity exists. In a blog post on Medium, the real-life creators of the Not Hotdog app stress that while the app was built over a weekend on one laptop with a single GPU, the time and effort that went into making the final product user-friendly was considerable. They spent weeks improving the overall accuracy of the app, as well as a whole weekend optimizing the user experience around iOS and Android permissions.

Computer vision for object recognition through machine learning, which is basically training a model to identify and classify objects in images, is not so simple to implement. It requires thousands of images as training data, as well as lots of time, effort, and patience from its developers. The Not Hotdog app shows us that while computer vision technology has heaps of potential, sufficient training data is crucial to making it work.

The accuracy of AI

Much like a young child who’s learning the difference between a banana and a block, a classifier that’s been trained with extensive data can still make mistakes. Consider the phenomenon that gripped social media back in 2016: the animals vs. food tweets created by meme machine Karen Zack (@teenybiscuit on Twitter). She posted a series of images that hilariously asked “chihuahua or muffin?”, “puppy or bagel?”, “parrot or guacamole?” — you get the point.

The ability to make these distinctions is the hallmark of any image classifier, of course. To that end, we wanted to see how image recognition from Clarifai, a company specializing in AI, would do when put to the test. It achieved an impressive accuracy rate, correctly distinguishing chihuahuas from muffins 95.8 percent of the time. However, in other tests where the software was asked to identify what was in a picture, it was way off the mark. In one case, the model failed to detect the object — a duck — and classified an area of water it was swimming in as a car.

Similarly, when Microsoft’s CaptionBot AI was shown a picture of a rather terrifying bug, it identified it as a dog. The classifier was uncertain, stating that it “can’t be confident.”

So where did these classifiers go wrong? Why is one classifier able to confidently distinguish chihuahuas from muffins, while others are unable to accurately classify a duck or a bug? On one level, we could say that it comes down to data. The more training data a classifier is given, the more accurate it will be at identifying an object. On the flip side, a lack of sufficient training data will hamper a classifier’s ability to detect objects. So, if we fed these classifiers more pictures of ducks and bugs, it would follow that they should get better at correctly identifying images.

We do need to acknowledge one obvious but significant truth: computer vision and human vision are nothing alike. Sometimes, we simply don’t know why a machine sees something totally different in an image. There lies the challenge.

Qualcomm has been working to improve the image classifying capabilities of our AI for a while now. Back in 2015, we were a top performer in an ImageNet challenge, which tested image recognition capabilities, specifically object localization, object detection, and scene classification.

We continue to research the latest applications of computer vision in a variety of exciting fields, including automotive, virtual reality, augmented reality, and IoT. Recently, we rolled out our Drive Data Platform, showcasing the increasing level of edge analytics that allows us to push the envelope in the field of autonomous driving.

Don’t miss Part 2 in which we discuss how computers see the world and explore some possible future and more serious applications of computer vision.

 

This article was written in collaboration with our former intern, Megan Smith.