Abstract:Significant advancements in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although these image-language models can efficiently represent both video and language branches, they typically employ a goal-free vision perception process and do not interact vision with language well during the answer generation, thus omitting crucial visual cues. In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process. VideoDistill generates answers only from question-related visual embeddings and follows a thinking-observing-answering approach that closely resembles human behavior, distinguishing it from previous research. Specifically, we develop a language-aware gating mechanism to replace the standard cross-attention, avoiding language's direct fusion into visual representations. We incorporate this mechanism into two key components of the entire framework. The first component is a differentiable sparse sampling module, which selects frames containing the necessary dynamics and semantics relevant to the questions. The second component is a vision refinement module that merges existing spatial-temporal attention layers to ensure the extraction of multi-grained visual semantics associated with the questions. We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance in both general and long-form VideoQA datasets. In Addition, we verify that VideoDistill can effectively alleviate the utilization of language shortcut solutions in the EgoTaskQA dataset.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in the video question - answering (VideoQA) task, which are mainly related to long - video understanding: 1. **Long - term Dependence**: Video representations that are independent of the target have difficulty handling long - term dependencies in videos. Existing methods, when dealing with long videos, often affect the overall understanding due to the inclusion of a large amount of information that is irrelevant or redundant to the question, and also bring huge computational costs. 2. **Multi - event Reasoning**: Methods that are independent of the target perform poorly when dealing with videos containing multiple events, which requires accurate understanding and correlation of different time points in the videos. 3. **Multi - scale Semantic Modeling**: Accurate semantic reasoning usually depends on multi - scale perception from local spatial regions to global temporal dynamics. However, existing target - independent methods, when implementing multi - scale visual modeling, either require customized sub - models or additional modalities such as bounding boxes and OCR features, and these methods are inefficient or infeasible in large - scale pre - training. 4. **Language Prior Phenomenon**: When training question - answer pairs, the model is easily influenced by language priors, that is, using obvious clues in the questions (mainly the relationship between data distribution and keywords) to predict answers, rather than through complex visual reasoning. This phenomenon is particularly significant in the early training stage and is called language bias, which usually leads to a performance gap in out - of - distribution testing. To solve the above problems, the paper proposes a language - aware (i.e., target - driven) visual - semantic distillation framework - **VideoDistill**. This framework improves the video question - answering task in the following ways: - **Language - Aware Gate Mechanism (LA - Gate)**: A multi - head cross - gate mechanism is introduced for cross - modal interaction, avoiding the direct fusion of language into visual representations. LA - Gate calculates the dependence of the question on the video - block embeddings and suppresses or stimulates the corresponding blocks in the subsequent attention layers. - **Differentiable Sparse Sampling Module**: Frames are encoded using pre - trained image - language models (such as CLIP), and then target - driven frame sampling is carried out, which significantly reduces the overhead of subsequent spatio - temporal attention and naturally avoids the problems of long - term dependence and multi - event reasoning. - **Visual Refinement Module**: Irrelevant visual semantics are eliminated, and multi - scale semantics related to the question are enhanced, supporting multi - level refinement. This module encodes sparsely - sampled frames into a global embedding related to the question for answer generation. Through these innovations, VideoDistill has achieved state - of - the - art performance on various video question - answering benchmarks, especially when dealing with long - video question - answering tasks. In addition, the paper also verifies that VideoDistill can effectively reduce the utilization of language shortcut solutions in the EgoTaskQA dataset.

VideoDistill: Language-aware Vision Distillation for Video Question Answering

Language-aware Visual Semantic Distillation for Video Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Language-Guided Visual Aggregation Network for Video Question Answering

A Video Question Answering Model Based on Knowledge Distillation.

Equivariant and Invariant Grounding for Video Question Answering

Video Question Answering with Semantic Disentanglement and Reasoning

Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation

Distilled Dual-Encoder Model for Vision-Language Understanding

Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Unlocking Video-LLM via Agent-of-Thoughts Distillation

ViLA: Efficient Video-Language Alignment for Video Question Answering

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

Multichannel Attention Refinement for Video Question Answering.

Video Question Answering Via Gradually Refined Attention over Appearance and Motion

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Distilling Vision-Language Models on Millions of Videos

Video Question Answering Using CLIP-Guided Visual-Text Attention