OnQ Blog

Embodied AI: How do AI-powered robots perceive the world? [+video]

Written by

Taco Cohen

Sep 21, 2023

Engineers in a camera lab outfitted with a robotic arm.

While robots have proliferated in recent years in smart cities, factories and homes, we are mostly interacting with robots controlled by classical handcrafted algorithms. These are robots that have a narrow goal and don’t learn from their surroundings very much. In contrast, artificial intelligence (AI) agents — robots, virtual assistants or other intelligent systems — that can interact with and learn from a physical environment are referred to as embodied AI. These agents are equipped with sensors (cameras, pressure sensors, accelerometers, etc.) that capture data from their surroundings, along with AI systems that can analyze and “learn” from the acquired data.

Watch the “Embodied AI” webinar to find out more

AI-powered robots learn through interaction with a physical environment.

Click to a larger image

Through trial and error, the AI agent develops a “world view”: an abstract representation and understanding of the spatial or temporal dimensions of our world. It learns to reach its goal, whether the goal is to walk, stack boxes or something else entirely.

Embodied AI can transform industries and improve lives. The opportunities are endless.

Think enhancing the manufacturing process, making entertainment and games more interactive and immersive, improving medical triage, surgery and elderly assistance, and making smart warehouses much more efficient and automated. The need for embodied AI is certainly there.

The aging population and labor shortages are already felt, especially in the developed world.¹ In the past years, robot density in the manufacturing industry has already increased significantly as a result. In the U.S., the density of robots per 10,000 employees grew to 255, a 45% increase from 2015.²

AI-powered robotics has potential to improve manufacturing, digital healthcare, entertainment and warehouses.

AI-powered robotics has immense potential for improving society.

Click to a larger image

What is needed for embodied AI to proliferate

At Qualcomm AI Research, we are working on applications of generative modelling to embodied AI and robotics, in order to go beyond classical robotics and enable capabilities such as:

Open vocabulary scene understanding.
Natural language interface.
Reasoning and common sense via large language models (LLMs).
Closed-loop control, dynamic actions via LLMs or diffusion models.
Vision-language-action models.

Robotics has a need for data efficiency, low latency, enhanced privacy and sensor processing. All these requirements can be achieved through on-device AI, which is why Qualcomm Technologies has been developing platforms to support the creation of more productive, autonomous and advanced robots, such as the Qualcomm Robotics Platforms. These platforms include the Qualcomm AI Engine, providing capabilities that can unleash innovative applications and possibilities.

AI processing at the edge meets the needs of embodied AI.

Click to a larger image

A data-efficient robot motion planning architecture

While AI processing at the edge constitutes a good basis for building embodied AI applications, there is a critical issue that remains to be solved. In contrast to internet AI, which learns from static datasets (e.g., ImageNet which contains 2D images) to solve various tasks, embodied AI learns by interacting with a physical environment. Such data is not readily available on the internet, and expensive to acquire. The Qualcomm AI Research team has developed a novel data-efficient architecture model to improve robots’ perception of their environment. We call this architecture “Geometric Algebra Transformers” (GATr) — sign up for my webinar to learn more.

GATr considers geometric structures of the physical environment through geometric algebra representations and equivariance. It has the scalability and expressivity of transformers. Experiments show an impressive performance, even with little data. At its core, GATr is a general-purpose architecture for geometric data. It has three components: geometric algebra representations, equivariant layers and a transformer architecture.

Geometric algebra representations

GATr uses a mathematical framework called geometric algebra to represent geometric data and perform computations on that data. By embedding different kinds of geometric data into a single geometric algebra, GATr can process a variety of geometric data types, making it suitable for a wide range of applications without requiring modifications to the network architecture.

Equivariant layers

The innovation we bring with equivariant neural networks is that no matter how you rotate or move the object, the generalized model will still identify the object. This is key for improving the data-efficiency of AI-powered robots.

Graphic explaining that equivariant neural networks will enhance AI-powered robotics.

When we transform network inputs, the outputs transform consistently.

Click to a larger image

Transformer architecture

GATr is based on the transformer architecture, one of the most successful generative AI architectures. The fundamental operation in a transformer is called self-attention, for which we propose an equivariant alternative while preserving the excellent scalability properties of classical self-attention.

A line graph showing how GATr performs well compared to other methods.

GATr performs well even with little data.

Click to a larger image

GATr outperforms other state-of-the-art architectures

You can look at our process for generating path planning for a robot in a comparable way to generating an image with a diffusion model, except instead of denoising an image we now denoise a robot trajectory. Furthermore, we use GATr as the denoising network, rather than the more standard U-Nets.

We tested our method on several tasks, including robotic block stacking. In the graph above, our method outperforms all previous methods with 1% of the training data. As we scale the number of items, our method continues to outperform. GATr scales to tens of thousands of tokens, outperforming the geometric deep learning baselines.

GATr Data Collection

Sep 15, 2023 | 1:20

Video Player is loading.

Current Time 0:00

Duration 1:19

Loaded: 7.51%

Stream Type LIVE

Remaining Time 1:19

Towards making embodied AI a reality

We believe that embodied AI will benefit society in manufacturing, healthcare and more. Our model architecture for data-efficient robot motion planning is just one of the embodied AI projects that the Qualcomm AI Research team is working on. I recommend you also check out our work in “Uncertainty-driven Affordance Discovery for Efficient Robotics Manipulation” for helping AI-powered robots make decisions.

On-device generative AI will continue to play a fundamental role in embodied AI. Furthermore, we believe that equivariance allows for more efficient understanding of 3D images/videos with AI. Stay tuned for more research in this direction.

References

1: United Nations, World Population Prospects, 2017

2: Bloomberg, 2023

Watch our webinar on embodied AI

AI Robotics

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Author

Taco CohenEngineer, Principal, Qualcomm Technologies Netherlands B.V.