May 13, 2020
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
If I asked you to imagine a blue elephant riding a unicycle on the moon, you could probably do so effortlessly. You’d most likely see a rather cartoon-like scenario play out in your mind. Similarly, I could ask you what would happen if you drop a cup of tea while talking with your mother. Once again, you’d likely be able to picture it. This ability to imagine worlds you’ve never seen or will never experience is an extremely powerful feature of human intelligence and one not currently shared with the best AI systems. And it may be the key to understanding intelligence.
While the raw information that hits our senses is a correlated flood of continuous signals such as light and sound waves, these signals are parsed in our brain into discrete objects and their relations. We see tables, chairs, teapots, and dogs. We understand the things they can do, and we understand how they relate to each other. We pour tea from the teapot into the teacup and then expect someone to drink it. Thus, we represent the world with discrete abstractions, including their relations, and subsequently reason about them. It is this modular representation of the world that makes it possible to visualize a blue elephant unicycling on the moon, because we can combine our representations for the modules “blue,” “elephant,” “unicycle,” “cycling,” and “moon” together in a novel way that will never materialize in the real world.
When we reason, we use common sense knowledge about the world. To imagine our blue elephant unicycling, we’re using the laws of physics. And to imagine the falling cup of tea, we use causal relations between objects: the tea will cause the carpet to stain. We even apply the “laws of psychology”: my mother will be angry after seeing the stained carpet. In other words, people have “common sense” about the world that computers mostly lack.
Supervised learning: powerful feature learning yet still fragile
In deep learning, we’ve made significant progress in converting the continuous input signal into more abstract representations (a.k.a. embeddings), with which it is much easier to make predictions. We call this feature learning.
These representations are usually learnt through a process called supervised learning, where we have access to both the input signal (pixels of an image) and the labels (is there a dog in the image). By compiling very large datasets of this kind, deep learning systems have surpassed even the best experts for some well-defined tasks. For instance, deep learning systems in the medical image analysis domain now routinely outperform the best physicians for tasks such as the prediction of melanoma from images of a person’s skin. However, these systems are also fragile: as soon as we present the system with data from a slightly different domain (e.g., dark skin versus light skin in melanoma detection), then the system might horribly fail. This fragility can be traced back to a lack of high-level understanding of the problem.
Unsupervised learning: learning from raw input signals
But learning does not always have to be supervised. Even without the annotations (is there a dog in the image), we can still define learning tasks. For instance, we can try to predict the future. Since the future will always arrive a few moments later, we created a “self-supervised learning” task. We can also simply try to compress the contents of the world into an efficient “code,” from which we can (approximately) reconstruct the input back. This is unsupervised learning, which aims to learn a modular representation of the world in terms of objects and their relations. This learned representation is highly compressed relative to the raw input signal (e.g., sound and light waves). Most of the uninteresting signal is discarded, but the part that’s useful for understanding the world is retained.
One way to do unsupervised learning is to build an “encoder” model that maps the raw inputs (e.g., images) into representations. To reconstruct the input data from the code, you also need a decoder, which models the generative process of how the data was produced and can incorporate the laws of physics. For instance, to generate an image of our friend the blue elephant, we need some models that generate images for elephant, unicycle, moon, and blue. And subsequently, we need a model that seamlessly put these elements together into an elephant on a unicycle on the moon. My claim is that the world is relatively simple to model in this generative direction.
In deep learning, however, we mostly care about the encoder, which models in the opposite direction. It uses the raw signals as input and generates representations for abstract objects and their relations. This is indeed the map we need to classify things and make predictions, but alas the map is complicated, needs a lot of parameters to model, and thus a lot of data to reliably train. And even then, this map is fragile when confronted with data that it’s never seen, and easily fooled by tinkered inputs designed to confuse the model (a.k.a. adversarial examples). What we need is the generative decoder model to guide the encoder to learn robust models.
Improving the encoder model
One way to build robust models is to imbue the encoder with structure. For instance, if we turn our heads, then the input hitting our retina drastically changes, yet we do not get confused and understand that we’re looking at the same world but at a different angle. The encoder in our brain is able to transform the representations deep in the cortex, rotating them around as we turn our head. We call this equivariance. It’s been shown that making neural nets also equivariant under rotations helps to make them more robust, efficient, and accurate.
But reasoning goes a step further. In contrast to a fast, one-shot inference step, it’s a slow iterative process where abstract symbols are manipulated over time. Daniel Kahneman identified this “fast and slow thinking” also in human reasoning. Imagine looking at a piece of paper that says: (5 x 24) – 47 = ?. The fast neural networks in our visual cortex instantly turn the light coming from the paper and hitting our retina into the discrete symbols “5,” x,”24”’ etc. But then, other parts of the brain, such as the prefrontal cortex, try to solve the equation. This reasoning process needs to perform computations and temporarily store information in memory, which is iterative and slow.
AI’s big challenge
The question of how to incorporate reasoning in AI will be the topic of the next decade, and it mirrors an old feud that goes back to the very beginning of AI. Symbolists believe that we need to “hardcode” the rules of logical reasoning into intelligent systems. Connectionists believe that everything can be learned by neural network-like architectures. As usual, the truth is likely somewhere in the middle.
I’m personally a fan of what I call “neural augmentation.” The core idea is to acknowledge that generative models are key to intelligence. The world is just more ordered and easier to model in this forward, causal, physical direction. What we call inference or reasoning is a form of inverse modeling, where we try to iteratively infer abstract, unknown quantities that explain the observations. It’s an encoder, but now iterated over time.
When we perform classical inference (a.k.a. symbolic reasoning), we don’t learn from examples. Instead we follow the rules of logic and try to find, through optimization, the most likely explanation for what we observe. However, it ignores an important opportunity to learn from a collection of such problems. Imagine you’re trying to reconstruct a damaged image where you understand the process that led to the damage. You can try to optimize some objective based on how the observed image was generated, to infer the best possible reconstruction, or you can create a dataset of many pairs of damaged and clean images and instead learn how to reconstruct them. In neural augmentation you do both: you use the classical model as your backbone iterative reasoning model and train a neural architecture to correct its mistakes. Or viewed from the other side: you embed the generative, causal model into the structure of the neural network.
I believe these types of hybrid architectures are key to designing the models of the future. They incorporate centuries of scientific knowledge in their “guts” and use the power of deep learning to correct the subtle patterns that are hidden in the data but are too complex to model explicitly. They may also be key to solving the “adversarial examples” problem where tiny designed changes to the input can completely confuse the system. Humans clearly do not get equally confused by these perturbations, and I hypothesize that this is because our generative understanding of the world filters out nonsensical predictions. And hopefully it may also point the way to incorporate some form of symbolic reasoning into our models.
In my next blog post, I’ll apply this idea of neural augmentation to a specific domain we’re exploring at Qualcomm AI Research. Stay tuned.