Qualcomm Interactive Video Dataset (QIVD)
Qualcomm Interactive Video Dataset (QIVD)
chip image

Qualcomm AI Research presents the QIVD dataset, which enables the study of visual understanding with Vision-Language Models (VLMs) to answer questions about events occurring in videos. This long-standing goal in AI is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations.

This is useful for AI systems that must interpret and respond to real-time visual and audio inputs, or to interact with a user in the real world and understand what is occurring.

Our dataset comprises 2900 data items, where each item contains:

  • A video with audio of a user asking a question about the video.
  • A transcription of a human-generated question asked during the recording, and a human-generated answer to that question.
  • A ground-truth timestamp that marks when it's appropriate to answer the question.
Qualcomm-image

Categories

Each item is assigned to one of 13 predefined semantic categories representing distinct visual reasoning capabilities:


Action Attribute
Inquiries about the manner of a performed action, such as Which hand did I use to wave? or How fast did I jump?—tests the ability to recognize fine-grained characteristics of dynamic events. 
Action Counting Questions about an action’s repetition frequency, such as How many times did I clap?— evaluates temporal reasoning and event segmentation capabilities. 
Action Detection Identifies the specific action performed, such as What am I doing right now?— assesses basic activity recognition in dynamic scenes.
Action Understanding Questions the purpose or outcome of an action, such as What does this gesture mean? or Why am I moving the chair?—tests higher-level action interpretation and intention recognition.
Object Attributes Inquiries about an object’s characteristics, such as What color is this book? or Is this cup empty or full?—evaluates fine-grained visual perception of static properties.
Object Counting Determines the number of objects present, such as How many pens are on the table?—tests quantitative reasoning and object individuation.
Object Detection Identifies scene objects, such as Is there a lamp in this room?—assesses basic object recognition capabilities. 
Object Referencing Indirectly points to an object within the scene, such as What am I pointing at? or What is behind me?—evaluates spatial reasoning and deictic reference resolution.
Object Understanding Questions about an object’s nature or function, such as What is this tool used for?—tests semantic knowledge about objects beyond mere recognition. 
Scene Understanding Inquiries about the environment, such as What room am I in? or Is it daytime or nighttime?— evaluates holistic scene interpretation.
Audio-Visual Questions that require audio information for a complete answer, such as What sound am I making? or Am I speaking loudly or softly?—tests cross-modal integration capabilities.
OCR Extracts text from an object, such as What does this sign say?—evaluates the capability to recognize text in the real world and within the context of the conversation. 
Subjective Solicits general opinions about an object or scene, such as Does this outfit look good?—tests a model’s ability to respond sensibly to subjective questions.

Dataset Details

Size 2900 Videos
Average Length 5.10 seconds
Total Number of Frames 443350
Average FPS 30
Average Image Resolution for Frames 640 x 382.29
Vocabulary Size 3624 Words / 3072 Tokens
Total Semantic Categories 13
Average Question Length (words) 6.09
Average Answer Length (words) 7.23
Average Short Answer Length (words) 1.38
Average Answer Timestamp (%) 81.47%
Total Questions

Questions with “where”
Questions with “how”
Questions with “what”
1661

47
512
1102
Total Deictic References

Questions with “here”
Questions with “these”
Questions with “that”
Questions with “there”
Questions with “this”
789

32
39
45
105
568
Average Answer Timestamp and Total Samples

Action Attributes
Action Counting
Action Detection
Action Understanding
Object Attributes
Object Counting
Object Detection
Object Referencing
Object Understanding
Scene Understanding
Audio-Visual
OCR
Subjective
81.47%          2900

84.31%          155
92.22%          225
85.46%          440
81.47%           110
79.52%          562
78.41%           286
76.95%          211
79.18%           706
80.63%          79
79.91%           38
90.09%          22
83.04%          23
77.39%          43
Source All the videos in the benchmark have been crowdsourced and then annotated by non-expert annotators. Following this, the videos undergo a rigorous quality check process and a semantic categorization process.
Language English

Data Format

The dataset is distributed in two files:

  • videos.zip: 2,900 video files in .mp4 format.
  • annotations.zip: Single .json file with all video annotations. This file is structured as a list of 2,900 dictionary entries, where each corresponds to one video and includes the following metadata and labels:
    • video: Video filename (e.g., "00000000.mp4").
    • question: Question asked in the video (e.g., "What am I holding in my left hand?").
    • answer: Complete answer to the question (e.g., "You are holding a Rubik's cube in your left hand.").
    • short_answer: Concise version of the answer with only the key information (e.g., "A Rubik's cube").
    • timestamp: Optimal time in the video to answer the question (e.g., "00:04.4").
    • category: Category of the question (e.g., "object referencing").

Dataset Citation Instructions

The dataset is intended for research purposes only. Please cite our paper if you use this dataset in your research.

 

@misc{pourreza2025visionlanguagemodelsanswerface,
    title={Can Vision-Language Models Answer Face to Face Questions in the Real-World?},
    author={Reza Pourreza and Rishit Dagli and Apratim Bhattacharyya and Sunny Panchal and Guillaume Berger and Roland Memisevic},
    year={2025},
    eprint={2503.19356},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2503.19356},
}

Dataset license

This dataset is intended for research purposes only and to support and contribute to the graph research community. The quality of the configuration space design and the collected execution times may be suboptimal and should not be considered as reference performances of the target device but rather as representative of the problem at hand for research purposes.

 

Data License Agreement - Research Use

Qualcomm AI Research

At Qualcomm AI Research, we are advancing AI to make its core capabilities – perception, reasoning, and action – ubiquitous across devices. Our mission is to make breakthroughs in fundamental AI research and scale them across industries. By bringing together some of the best minds in the field, we’re pushing the boundaries of what’s possible and shaping the future of AI.
 

Qualcomm AI Research continues to invest in and support deep-learning research in computer vision. The publication of the AirLetters dataset for use by the AI research community is one of our many initiatives.

Find out more about Qualcomm AI Research.

For any questions or technical support, please contact us at [email protected]
 

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

Scene Split Segments Frames Per Segments Total Frames Includes stops
FloodedGrounds Train 21 30 630 No
SciFifacility Train 21 30 630 No
Sun Temple Train 36 30 1080 No
CB-Apocalypse Train 30 30 900 No
SciFiBase Train 41 30 1230 No
FloodedGrounds Bridges Train 201 30 600 Yes
SciliBase NightStartStop Train 20 30 600 Yes
ScitiBaseStartStop Train 21 301 630 Yes
Sun TempleBush Train 11 30 330 Yes
Sun TempleLamps Train 21 30 630 Yes
AbandonedSchool Test 21 300 600 Yes
SpaceShipDemo Test 2 300 600 Yes
Seaport2 Test 1 300 3001 Yes
Total Training   242 301 7260  
Total Test   5 300 1500  
Total   - - 8760  
SceneAssets UsedSource
CB-ApocalypseCBU: Apocalypse EditionUnity Asset Store
 ULTIMATE ANIMATIONUnity Asset Store
 COLLECTION 
 Animal Pack DeluxeUnity Asset Store
 Customizable Survivors PackUnity Asset Store
SciFifacilitySci-Fi FacilityUnity Asset Store
 Robot Warnors CartoonUnity Asset Store
"Flooded Grounds
FloodedGrounds Bridges"Flooded GroundsUnity Asset Store
 Ghoul-zombieUnity Asset Store
 ZombieUnity Asset Store
 Fantastic Creature #1|Unity Asset Store
"SciFiBase
ScifiBaseNightStartStop
Sciti BaseStartStop"Sci-Fi baseUnity Asset Store
 Robot 1|Unity Asset Store
 Robot Warriors CartoonUnity Asset Store
 Robot SphereUnity Asset Store
Sun TempleSun TempleUnity Asset Store
 Animal Pack DeluxeUnity Asset Store
 Dragon for Boss Monster: HPUnity Asset Store
Sun TempkBushSun TempleUnity Asset Store
 Real Landscapes • Valley ForestUnity Anset Store
SunTempkelampsSun TempleUnity Asset Store
 Dragon for Boss Monster: HPUnity Arset Store
 Flooded GroundsUnity Asset Store
AbandonedSchoolAnimal Pack DeluxeUnity Asset Store
 HQ Abandoned School (Modular)Unity Asset Store
SpaceShipDemoSpace Ship Demohttps://github.com/Unity-TechnologiesSpaceshipDemo
Seaport2Old Sea PortUnity Asset Store
 Fantasy Monster - SkeletonUnity Asset Store
 Dungeon Skeletons DemoUnity Asset Store

Connect with our communities

Stay ahead of the curve

Receive the latest updates, exclusive offers, and valuable insights delivered through the Qualcomm newsletter straight to your inbox.

Stay ahead of the curve

Receive the latest updates, exclusive offers, and valuable insights delivered through the Qualcomm newsletter straight to your inbox.

Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.