Back to All
Developer Blog

Fast and accurate language-based 3D scene understanding

Efficient and accurate 3D scene understanding remains a long-standing challenge in computer vision, especially for applications spanning XR, robotics, and spatial AI. Recent language-based perception models have shown that a single structured language architecture can support multiple 3D tasks, such as layout estimation and 3D object detection.

However, most existing language-based approaches rely on step-by-step token generation, which results in high inference latency and limits their practicality for real-time or large-scale use.

In this article, we present Fast SceneScript, a research contribution accepted to CVPR 2026, which explores how multi-token prediction can significantly accelerate inference while preserving accuracy for language-based 3D scene understanding.

This post summarizes the key research ideas, motivations, and findings behind Fast SceneScript.

 

Why language-based 3D scene understanding is slow

Language-based scene understanding represents 3D geometry and semantics as structured sequences of tokens following a predefined schema. This formulation enables strong flexibility: the same model can be applied to different 3D perception tasks simply by changing the output token structure.

The main drawback is efficiency. Existing methods typically generate one token at a time, which causes inference cost to scale linearly with sequence length. As scene complexity increases, this token-by-token decoding quickly becomes a bottleneck.

 

Faster decoding with multi-token prediction

Fast SceneScript addresses this efficiency issue by predicting multiple structured tokens per decoding step, significantly reducing the number of required forward passes during inference. As shown in (b) and (c), the gray vertical lines mark individual decoder passes. While SceneScript requires 21 decoding iterations to generate the full sequence, Fast SceneScript produces the same output in just 3 iterations, resulting in substantially higher efficiency. 

Sign up for Developer monthly newsletter

Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.

Comparison diagram showing that Fast SceneScript generates the same structured 3D scene output in 3 decoding iterations versus 21 iterations for SceneScript.

Reliable multi-token decoding

While multi-token prediction can dramatically improve speed, it introduces the risk of inaccurate predictions when future tokens are generated in parallel. The core insight of Fast SceneScript is that multi-token prediction can be made reliable for structured 3D scenes by carefully selecting and accepting only trustworthy predictions during decoding. This design enables substantial speedups while maintaining the accuracy of traditional autoregressive models. The paper explores two alternative strategies:

  • Self-speculative decoding (SSD), which verifies predicted tokens and retains only those that remain consistent
  • Confidence-guided decoding (CGD), which estimates token reliability directly and stops decoding when predictions become uncertain

Both strategies are adapted to structured scene tokens, allowing Fast SceneScript to safely exploit multi-token prediction without introducing accuracy degradation.

In this example, Fast SceneScript produces results that are comparable to SceneScript using next-token prediction, while significantly outperforming SceneScript with naive multi-token prediction, which does not include any verification. This highlights the importance of reliability control when predicting multiple tokens in parallel.

Example comparison showing Fast SceneScript with reliability controls producing results comparable to next-token prediction and more accurate than naive multi-token prediction.

Experimental findings

Extensive evaluations on both synthetic and real-world benchmarks show that Fast SceneScript delivers substantial inference-time speedups for layout estimation and 3D object detection while matching or exceeding the accuracy of prior language-based approaches.

These results demonstrate that language-based 3D scene understanding does not need to trade accuracy for speed, and that structured decoding strategies play a critical role in enabling efficient.

 

Research implications

Fast SceneScript is intended to advance research on efficient structured language models for perception. The ideas presented in this work highlight the importance of decoding strategies when applying language models to structured, geometry-heavy tasks such as 3D scene understanding.

We hope this work will inspire further research into fast, unified perception models for XR, robotics, and spatial AI.

Come for support, stay for the community

Get support from experts, connect with like-minded developers, and access exclusive virtual events.

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Authors
Theo Gevers
Theo Gevers
Ruihong Yin
Ruihong Yin
Xuepeng Shi
Xuepeng ShiEngineer, Senior at Qualcomm Technologies, Inc.
Alex Bailo
Alex BailoEngineer, Staff at Qualcomm Technologies, Inc.
Marco Manfredi
Marco ManfrediSenior Staff at Qualcomm Technologies, Inc.
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.