On this Page
Introduction
Categories
Dataset Details
Data Format
Dataset Citation Instructions
Dataset license
Qualcomm AI Research
Community
Get the Qualcomm® newsletter straight to your inbox.
Qualcomm AI Research presents the QIVD dataset, which enables the study of visual understanding with Vision-Language Models (VLMs) to answer questions about events occurring in videos. This long-standing goal in AI is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations.
This is useful for AI systems that must interpret and respond to real-time visual and audio inputs, or to interact with a user in the real world and understand what is occurring.
Our dataset comprises 2900 data items, where each item contains:
- A video with audio of a user asking a question about the video.
- A transcription of a human-generated question asked during the recording, and a human-generated answer to that question.
- A ground-truth timestamp that marks when it's appropriate to answer the question.
Categories
Each item is assigned to one of 13 predefined semantic categories representing distinct visual reasoning capabilities:
Action Attribute |
Inquiries about the manner of a performed action, such as Which hand did I use to wave? or How fast did I jump?—tests the ability to recognize fine-grained characteristics of dynamic events. |
| Action Counting | Questions about an action’s repetition frequency, such as How many times did I clap?— evaluates temporal reasoning and event segmentation capabilities. |
| Action Detection | Identifies the specific action performed, such as What am I doing right now?— assesses basic activity recognition in dynamic scenes. |
| Action Understanding | Questions the purpose or outcome of an action, such as What does this gesture mean? or Why am I moving the chair?—tests higher-level action interpretation and intention recognition. |
| Object Attributes | Inquiries about an object’s characteristics, such as What color is this book? or Is this cup empty or full?—evaluates fine-grained visual perception of static properties. |
| Object Counting | Determines the number of objects present, such as How many pens are on the table?—tests quantitative reasoning and object individuation. |
| Object Detection | Identifies scene objects, such as Is there a lamp in this room?—assesses basic object recognition capabilities. |
| Object Referencing | Indirectly points to an object within the scene, such as What am I pointing at? or What is behind me?—evaluates spatial reasoning and deictic reference resolution. |
| Object Understanding | Questions about an object’s nature or function, such as What is this tool used for?—tests semantic knowledge about objects beyond mere recognition. |
| Scene Understanding | Inquiries about the environment, such as What room am I in? or Is it daytime or nighttime?— evaluates holistic scene interpretation. |
| Audio-Visual | Questions that require audio information for a complete answer, such as What sound am I making? or Am I speaking loudly or softly?—tests cross-modal integration capabilities. |
| OCR | Extracts text from an object, such as What does this sign say?—evaluates the capability to recognize text in the real world and within the context of the conversation. |
| Subjective | Solicits general opinions about an object or scene, such as Does this outfit look good?—tests a model’s ability to respond sensibly to subjective questions. |
Dataset Details
| Size | 2900 Videos |
| Average Length | 5.10 seconds |
| Total Number of Frames | 443350 |
| Average FPS | 30 |
| Average Image Resolution for Frames | 640 x 382.29 |
| Vocabulary Size | 3624 Words / 3072 Tokens |
| Total Semantic Categories | 13 |
| Average Question Length (words) | 6.09 |
| Average Answer Length (words) | 7.23 |
| Average Short Answer Length (words) | 1.38 |
| Average Answer Timestamp (%) | 81.47% |
| Total Questions Questions with “where” Questions with “how” Questions with “what” |
1661 47 512 1102 |
| Total Deictic References Questions with “here” Questions with “these” Questions with “that” Questions with “there” Questions with “this” |
789 32 39 45 105 568 |
| Average Answer Timestamp and Total Samples Action Attributes Action Counting Action Detection Action Understanding Object Attributes Object Counting Object Detection Object Referencing Object Understanding Scene Understanding Audio-Visual OCR Subjective |
81.47% 2900 84.31% 155 92.22% 225 85.46% 440 81.47% 110 79.52% 562 78.41% 286 76.95% 211 79.18% 706 80.63% 79 79.91% 38 90.09% 22 83.04% 23 77.39% 43 |
| Source | All the videos in the benchmark have been crowdsourced and then annotated by non-expert annotators. Following this, the videos undergo a rigorous quality check process and a semantic categorization process. |
| Language | English |
Data Format
The dataset is distributed in two files:
- videos.zip: 2,900 video files in .mp4 format.
- annotations.zip: Single .json file with all video annotations. This file is structured as a list of 2,900 dictionary entries, where each corresponds to one video and includes the following metadata and labels:
- video: Video filename (e.g., "00000000.mp4").
- question: Question asked in the video (e.g., "What am I holding in my left hand?").
- answer: Complete answer to the question (e.g., "You are holding a Rubik's cube in your left hand.").
- short_answer: Concise version of the answer with only the key information (e.g., "A Rubik's cube").
- timestamp: Optimal time in the video to answer the question (e.g., "00:04.4").
- category: Category of the question (e.g., "object referencing").
Dataset Citation Instructions
The dataset is intended for research purposes only. Please cite our paper if you use this dataset in your research.
@misc{pourreza2025visionlanguagemodelsanswerface,
title={Can Vision-Language Models Answer Face to Face Questions in the Real-World?},
author={Reza Pourreza and Rishit Dagli and Apratim Bhattacharyya and Sunny Panchal and Guillaume Berger and Roland Memisevic},
year={2025},
eprint={2503.19356},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.19356},
}
Dataset license
This dataset is intended for research purposes only and to support and contribute to the graph research community. The quality of the configuration space design and the collected execution times may be suboptimal and should not be considered as reference performances of the target device but rather as representative of the problem at hand for research purposes.
Qualcomm AI Research
At Qualcomm AI Research, we are advancing AI to make its core capabilities – perception, reasoning, and action – ubiquitous across devices. Our mission is to make breakthroughs in fundamental AI research and scale them across industries. By bringing together some of the best minds in the field, we’re pushing the boundaries of what’s possible and shaping the future of AI.
Qualcomm AI Research continues to invest in and support deep-learning research in computer vision. The publication of the AirLetters dataset for use by the AI research community is one of our many initiatives.
Find out more about Qualcomm AI Research.
For any questions or technical support, please contact us at [email protected]
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
| Scene | Split | Segments | Frames Per Segments | Total Frames | Includes stops |
|---|---|---|---|---|---|
| FloodedGrounds | Train | 21 | 30 | 630 | No |
| SciFifacility | Train | 21 | 30 | 630 | No |
| Sun Temple | Train | 36 | 30 | 1080 | No |
| CB-Apocalypse | Train | 30 | 30 | 900 | No |
| SciFiBase | Train | 41 | 30 | 1230 | No |
| FloodedGrounds Bridges | Train | 201 | 30 | 600 | Yes |
| SciliBase NightStartStop | Train | 20 | 30 | 600 | Yes |
| ScitiBaseStartStop | Train | 21 | 301 | 630 | Yes |
| Sun TempleBush | Train | 11 | 30 | 330 | Yes |
| Sun TempleLamps | Train | 21 | 30 | 630 | Yes |
| AbandonedSchool | Test | 21 | 300 | 600 | Yes |
| SpaceShipDemo | Test | 2 | 300 | 600 | Yes |
| Seaport2 | Test | 1 | 300 | 3001 | Yes |
| Total Training | 242 | 301 | 7260 | ||
| Total Test | 5 | 300 | 1500 | ||
| Total | - | - | 8760 |
| Scene | Assets Used | Source |
|---|---|---|
| CB-Apocalypse | CBU: Apocalypse Edition | Unity Asset Store |
| ULTIMATE ANIMATION | Unity Asset Store | |
| COLLECTION | ||
| Animal Pack Deluxe | Unity Asset Store | |
| Customizable Survivors Pack | Unity Asset Store | |
| SciFifacility | Sci-Fi Facility | Unity Asset Store |
| Robot Warnors Cartoon | Unity Asset Store | |
| "Flooded Grounds | ||
| FloodedGrounds Bridges" | Flooded Grounds | Unity Asset Store |
| Ghoul-zombie | Unity Asset Store | |
| Zombie | Unity Asset Store | |
| Fantastic Creature #1| | Unity Asset Store | |
| "SciFiBase | ||
| ScifiBaseNightStartStop | ||
| Sciti BaseStartStop" | Sci-Fi base | Unity Asset Store |
| Robot 1| | Unity Asset Store | |
| Robot Warriors Cartoon | Unity Asset Store | |
| Robot Sphere | Unity Asset Store | |
| Sun Temple | Sun Temple | Unity Asset Store |
| Animal Pack Deluxe | Unity Asset Store | |
| Dragon for Boss Monster: HP | Unity Asset Store | |
| Sun TempkBush | Sun Temple | Unity Asset Store |
| Real Landscapes • Valley Forest | Unity Anset Store | |
| SunTempkelamps | Sun Temple | Unity Asset Store |
| Dragon for Boss Monster: HP | Unity Arset Store | |
| Flooded Grounds | Unity Asset Store | |
| AbandonedSchool | Animal Pack Deluxe | Unity Asset Store |
| HQ Abandoned School (Modular) | Unity Asset Store | |
| SpaceShipDemo | Space Ship Demo | https://github.com/Unity-TechnologiesSpaceshipDemo |
| Seaport2 | Old Sea Port | Unity Asset Store |
| Fantasy Monster - Skeleton | Unity Asset Store | |
| Dungeon Skeletons Demo | Unity Asset Store |
Connect with our communities
Stay ahead of the curve
Receive the latest updates, exclusive offers, and valuable insights delivered through the Qualcomm newsletter straight to your inbox.
Stay ahead of the curve
Receive the latest updates, exclusive offers, and valuable insights delivered through the Qualcomm newsletter straight to your inbox.
