Back to All
OnQ Blog

Exploring the next frontier of AI: Multimodal systems and real-time interaction

Discover the state of the art in large multimodal models with Qualcomm AI Research
Qualcomm-image

In the realm of artificial intelligence (AI), the integration of senses — seeing, hearing and interacting — represents a frontier that is rapidly expanding thanks to the proliferation of large multimodal models (LMMs). 

The expanding capability of AI systems to process and understand multimodal data offers groundbreaking possibilities for applications in Internet of Things (IoT), mobile, robotics and more. It also opens the door to fundamental changes in how we engage with technology, allowing for the kind of nuance and real-time flow you would see with human interactions. 

One of the significant advancements in this field is the ability to run AI systems directly on devices. This development not only reduces memory costs and latency, but also enhances reliability and security. Qualcomm AI Research has been at the forefront, pushing the boundaries with innovations like the world's first multimodal large language model (LLM) running on an Android phone earlier this year. This breakthrough demonstrates the potential for more sophisticated AI applications that are accessible on everyday devices:

Real time solution

Real time model streaming, which is the ability to collect and ingest data from various sources and process it in real time, is a key requirement for an AI model to understand its environment and interact with nearby humans.  

An auto-regressive language model is a useful component of such a multimodal agent because it's already able to perform a dialogue with a user. Additionally, language makes it easy to encode surrogate tasks for a degree of “common sense” to emerge.

As a result, a streaming multimodal LLM is seen as a solution to two challenges facing open-ended, asynchronous interactions with situated agents:

  • Limited to turn-based interactions about offline documents or images.
  • Limited to capturing momentary snapshots of reality in a visual question answering (VQA) type of dialogue.

A few years ago, we expanded generative AI efforts to study multimodal language models, with the main motivation to extend the use of language from mere communication to reasoning and enhancing a model’s common sense understanding of everyday situations.

We have made progress with end-to-end training of situated vision language models, where the model can process a live video stream in real time and dynamically interact with users. While this technology may provide immediate value in the form of coaching and live assistance, we believe that even the long-term path towards the current widely extolled humanoid robots will require enabling AI models to reason and interact with us using common sense. 

Qualcomm-image
Figure 1. Sample from the labeled fitness dataset. This allows the AI agent to better recognize human gestures.

Towards an integrated interactive solution 

Qualcomm Technologies, Inc. continues to support the machine learning community by providing tools that enable the development of applications based on multimodal interactions. These tools foster innovation and allow developers to explore new frontiers in AI applications, from AI models with enhanced common-sense reasoning to intelligent personal assistants.

In the past years, the Qualcomm Innovation Center (QuIC) has open-sourced these tools on Github.

Our research, “Is end-to-end learning enough for fitness activity recognition?”, has shown that efficient visual streaming at inference time can be enabled using 3D steppable, causal convolutions. Now developers can enhance their applications with the ability to see and interact with humans using any RGB camera thanks to the Sense inference engine

End-to-end training in conjunction with the distillation of multimodal reasoning capabilities (Look, remember and reason: Grounded reasoning in videos with language models) allows a model to provide useful and accurate feedback in real-time. We have incorporated this seamless approach in the fitness demo that we showcase in the “Integrating senses in AI” webinar.

Qualcomm-image
Figure 2. Sample from the labeled fitness dataset. This enables the AI agent to identify fitness techniques and correct them when necessary.

Multimodal learning will continue to advance

The future of AI lies in its ability to function in real-time and with a situational awareness of its surroundings, enabling machines to interact with the world in ways that are as complex and nuanced as human interaction. Qualcomm Technologies is at the forefront of this technology, pushing the boundaries of what AI can achieve with multimodal learning and on-device processing. As AI continues to evolve, it can transform our interactions with technology, making them more intuitive, efficient and secure.

 

What's next?

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Author
Roland Memisevic
Roland MemisevicSr. Director, Engineering, Qualcomm Canada ULC
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.