Interactive Video Dataset for Vision-Language Models

Qualcomm Interactive Video Dataset (QIVD)

On this Page

Introduction

Categories

Each item is assigned to one of 13 predefined semantic categories representing distinct visual reasoning capabilities:

Action Attribute	Inquiries about the manner of a performed action, such as Which hand did I use to wave? or How fast did I jump?—tests the ability to recognize fine-grained characteristics of dynamic events.
Action Counting	Questions about an action’s repetition frequency, such as How many times did I clap?— evaluates temporal reasoning and event segmentation capabilities.
Action Detection	Identifies the specific action performed, such as What am I doing right now?— assesses basic activity recognition in dynamic scenes.
Action Understanding	Questions the purpose or outcome of an action, such as What does this gesture mean? or Why am I moving the chair?—tests higher-level action interpretation and intention recognition.
Object Attributes	Inquiries about an object’s characteristics, such as What color is this book? or Is this cup empty or full?—evaluates fine-grained visual perception of static properties.
Object Counting	Determines the number of objects present, such as How many pens are on the table?—tests quantitative reasoning and object individuation.
Object Detection	Identifies scene objects, such as Is there a lamp in this room?—assesses basic object recognition capabilities.
Object Referencing	Indirectly points to an object within the scene, such as What am I pointing at? or What is behind me?—evaluates spatial reasoning and deictic reference resolution.
Object Understanding	Questions about an object’s nature or function, such as What is this tool used for?—tests semantic knowledge about objects beyond mere recognition.
Scene Understanding	Inquiries about the environment, such as What room am I in? or Is it daytime or nighttime?— evaluates holistic scene interpretation.
Audio-Visual	Questions that require audio information for a complete answer, such as What sound am I making? or Am I speaking loudly or softly?—tests cross-modal integration capabilities.
OCR	Extracts text from an object, such as What does this sign say?—evaluates the capability to recognize text in the real world and within the context of the conversation.
Subjective	Solicits general opinions about an object or scene, such as Does this outfit look good?—tests a model’s ability to respond sensibly to subjective questions.

Dataset Details

Size	2900 Videos
Average Length	5.10 seconds
Total Number of Frames	443350
Average FPS	30
Average Image Resolution for Frames	640 x 382.29
Vocabulary Size	3624 Words / 3072 Tokens
Total Semantic Categories	13
Average Question Length (words)	6.09
Average Answer Length (words)	7.23
Average Short Answer Length (words)	1.38
Average Answer Timestamp (%)	81.47%
Total Questions Questions with “where” Questions with “how” Questions with “what”	1661 47 512 1102
Total Deictic References Questions with “here” Questions with “these” Questions with “that” Questions with “there” Questions with “this”	789 32 39 45 105 568
Average Answer Timestamp and Total Samples Action Attributes Action Counting Action Detection Action Understanding Object Attributes Object Counting Object Detection Object Referencing Object Understanding Scene Understanding Audio-Visual OCR Subjective	81.47% 2900 84.31% 155 92.22% 225 85.46% 440 81.47% 110 79.52% 562 78.41% 286 76.95% 211 79.18% 706 80.63% 79 79.91% 38 90.09% 22 83.04% 23 77.39% 43
Source	All the videos in the benchmark have been crowdsourced and then annotated by non-expert annotators. Following this, the videos undergo a rigorous quality check process and a semantic categorization process.
Language	English

Data Format

The dataset is distributed in two files:

videos.zip: 2,900 video files in .mp4 format.
annotations.zip: Single .json file with all video annotations. This file is structured as a list of 2,900 dictionary entries, where each corresponds to one video and includes the following metadata and labels:
- video: Video filename (e.g., "00000000.mp4").
- question: Question asked in the video (e.g., "What am I holding in my left hand?").
- answer: Complete answer to the question (e.g., "You are holding a Rubik's cube in your left hand.").
- short_answer: Concise version of the answer with only the key information (e.g., "A Rubik's cube").
- timestamp: Optimal time in the video to answer the question (e.g., "00:04.4").
- category: Category of the question (e.g., "object referencing").

Dataset Citation Instructions

The dataset is intended for research purposes only. Please cite our paper if you use this dataset in your research.

@misc{pourreza2025visionlanguagemodelsanswerface,
    title={Can Vision-Language Models Answer Face to Face Questions in the Real-World?},
    author={Reza Pourreza and Rishit Dagli and Apratim Bhattacharyya and Sunny Panchal and Guillaume Berger and Roland Memisevic},
    year={2025},
    eprint={2503.19356},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2503.19356},
}

Dataset license

This dataset is intended for research purposes only and to support and contribute to the graph research community. The quality of the configuration space design and the collected execution times may be suboptimal and should not be considered as reference performances of the target device but rather as representative of the problem at hand for research purposes.

Data License Agreement - Research Use

Qualcomm AI Research

At Qualcomm AI Research, we are advancing AI to make its core capabilities – perception, reasoning, and action – ubiquitous across devices. Our mission is to make breakthroughs in fundamental AI research and scale them across industries. By bringing together some of the best minds in the field, we’re pushing the boundaries of what’s possible and shaping the future of AI.

Qualcomm AI Research continues to invest in and support deep-learning research in computer vision. The publication of the AirLetters dataset for use by the AI research community is one of our many initiatives.

Find out more about Qualcomm AI Research.

For any questions or technical support, please contact us at research.datasets@qti.qualcomm.com

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

Scene	Split	Segments	Frames Per Segments	Total Frames	Includes stops
FloodedGrounds	Train	21	30	630	No
SciFifacility	Train	21	30	630	No
Sun Temple	Train	36	30	1080	No
CB-Apocalypse	Train	30	30	900	No
SciFiBase	Train	41	30	1230	No
FloodedGrounds Bridges	Train	201	30	600	Yes
SciliBase NightStartStop	Train	20	30	600	Yes
ScitiBaseStartStop	Train	21	301	630	Yes
Sun TempleBush	Train	11	30	330	Yes
Sun TempleLamps	Train	21	30	630	Yes
AbandonedSchool	Test	21	300	600	Yes
SpaceShipDemo	Test	2	300	600	Yes
Seaport2	Test	1	300	3001	Yes
Total Training		242	301	7260
Total Test		5	300	1500
Total		-	-	8760

Scene	Assets Used	Source
CB-Apocalypse	CBU: Apocalypse Edition	Unity Asset Store
	ULTIMATE ANIMATION	Unity Asset Store
	COLLECTION
	Animal Pack Deluxe	Unity Asset Store
	Customizable Survivors Pack	Unity Asset Store
SciFifacility	Sci-Fi Facility	Unity Asset Store
	Robot Warnors Cartoon	Unity Asset Store
"Flooded Grounds
FloodedGrounds Bridges"	Flooded Grounds	Unity Asset Store
	Ghoul-zombie	Unity Asset Store
	Zombie	Unity Asset Store
	Fantastic Creature #1\|	Unity Asset Store
"SciFiBase
ScifiBaseNightStartStop
Sciti BaseStartStop"	Sci-Fi base	Unity Asset Store
	Robot 1\|	Unity Asset Store
	Robot Warriors Cartoon	Unity Asset Store
	Robot Sphere	Unity Asset Store
Sun Temple	Sun Temple	Unity Asset Store
	Animal Pack Deluxe	Unity Asset Store
	Dragon for Boss Monster: HP	Unity Asset Store
Sun TempkBush	Sun Temple	Unity Asset Store
	Real Landscapes • Valley Forest	Unity Anset Store
SunTempkelamps	Sun Temple	Unity Asset Store
	Dragon for Boss Monster: HP	Unity Arset Store
	Flooded Grounds	Unity Asset Store
AbandonedSchool	Animal Pack Deluxe	Unity Asset Store
	HQ Abandoned School (Modular)	Unity Asset Store
SpaceShipDemo	Space Ship Demo	https://github.com/Unity-TechnologiesSpaceshipDemo
Seaport2	Old Sea Port	Unity Asset Store
	Fantasy Monster - Skeleton	Unity Asset Store
	Dungeon Skeletons Demo	Unity Asset Store