Abstract:Video question-answering is a fundamental task in the field of video understanding. Although current vision--language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image--text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as the video lengths increase. To mitigate this issue, we propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions. MDF passively minimizes the risk of key frame omission in a bootstrap manner, while MIS actively searches key frames customized for each video--question pair with the assistance of auxiliary models. The experimental results on three public datasets from three advanced VLMs (CLIP, GIT and All-in-one) demonstrate that our proposed strategies can boost the performance for image-text pretrained models. The source codes pertaining to the method proposed in this paper are publicly available at <a class="link-external link-https" href="https://github.com/declare-lab/sas-vqa" rel="external noopener nofollow">this https URL</a>.

Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models

Simple and Effective Visual Question Answering in a Single Modality

Adversarial Sample Synthesis for Visual Question Answering

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

VidCtx: Context-aware Video Question Answering with Image Models

Video Question Generation for Dynamic Changes

Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering

Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA

Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems

Frame Augmented Alternating Attention Network for Video Question Answering.

Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues

ViLA: Efficient Video-Language Alignment for Video Question Answering

Video Question Answering: a Survey of Models and Datasets

Video Question Answering with Semantic Disentanglement and Reasoning

Retrieval-based Video Language Model for Efficient Long Video Question Answering

An Adaptive Video Clip Sampling Approach for Enhancing Query-Based Moment Retrieval in Videos.