Efficient and accurate 3D scene understanding remains a long-standing challenge in computer vision, especially for applications spanning XR, robotics, and spatial AI. Recent language-based perception models have shown that a single structured language architecture can support multiple 3D tasks, such as layout estimation and 3D object detection.
However, most existing language-based approaches rely on step-by-step token generation, which results in high inference latency and limits their practicality for real-time or large-scale use.
In this article, we present Fast SceneScript, a research contribution accepted to CVPR 2026, which explores how multi-token prediction can significantly accelerate inference while preserving accuracy for language-based 3D scene understanding.
This post summarizes the key research ideas, motivations, and findings behind Fast SceneScript.
Why language-based 3D scene understanding is slow
Language-based scene understanding represents 3D geometry and semantics as structured sequences of tokens following a predefined schema. This formulation enables strong flexibility: the same model can be applied to different 3D perception tasks simply by changing the output token structure.
The main drawback is efficiency. Existing methods typically generate one token at a time, which causes inference cost to scale linearly with sequence length. As scene complexity increases, this token-by-token decoding quickly becomes a bottleneck.
Faster decoding with multi-token prediction
Fast SceneScript addresses this efficiency issue by predicting multiple structured tokens per decoding step, significantly reducing the number of required forward passes during inference. As shown in (b) and (c), the gray vertical lines mark individual decoder passes. While SceneScript requires 21 decoding iterations to generate the full sequence, Fast SceneScript produces the same output in just 3 iterations, resulting in substantially higher efficiency.
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
Reliable multi-token decoding
While multi-token prediction can dramatically improve speed, it introduces the risk of inaccurate predictions when future tokens are generated in parallel. The core insight of Fast SceneScript is that multi-token prediction can be made reliable for structured 3D scenes by carefully selecting and accepting only trustworthy predictions during decoding. This design enables substantial speedups while maintaining the accuracy of traditional autoregressive models. The paper explores two alternative strategies:
- Self-speculative decoding (SSD), which verifies predicted tokens and retains only those that remain consistent
- Confidence-guided decoding (CGD), which estimates token reliability directly and stops decoding when predictions become uncertain
Both strategies are adapted to structured scene tokens, allowing Fast SceneScript to safely exploit multi-token prediction without introducing accuracy degradation.
In this example, Fast SceneScript produces results that are comparable to SceneScript using next-token prediction, while significantly outperforming SceneScript with naive multi-token prediction, which does not include any verification. This highlights the importance of reliability control when predicting multiple tokens in parallel.
Experimental findings
Extensive evaluations on both synthetic and real-world benchmarks show that Fast SceneScript delivers substantial inference-time speedups for layout estimation and 3D object detection while matching or exceeding the accuracy of prior language-based approaches.
These results demonstrate that language-based 3D scene understanding does not need to trade accuracy for speed, and that structured decoding strategies play a critical role in enabling efficient.
Research implications
Fast SceneScript is intended to advance research on efficient structured language models for perception. The ideas presented in this work highlight the importance of decoding strategies when applying language models to structured, geometry-heavy tasks such as 3D scene understanding.
We hope this work will inspire further research into fast, unified perception models for XR, robotics, and spatial AI.
Come for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.





