Abstract:While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design $\textit{QUAG-attention}$, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models' abilities to learn highly-coupled multimodal representations. Hence, we design the $\textit{CLAVI}$ (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets (project page: <a class="link-external link-https" href="https://dissect-videoqa.github.io" rel="external noopener nofollow">this https URL</a> ).

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to explore whether the current VideoQA (Video Question Answering) Transformer models truly possess the ability of cross - modal understanding and fusion. Specifically, the paper focuses on the following issues: 1. **Whether the model truly understands multimodal data**: Existing VideoQA Transformer models perform well in standard benchmark tests, but the reasons for their success are not entirely clear. Do these models truly capture the rich multimodal structures and dynamics in videos and texts? Or do they achieve high scores by exploiting biases and spurious features in the datasets? 2. **The effectiveness of multimodal fusion**: To evaluate the degree of the model's dependence on multimodal fusion, the author designed the QUAG (QUadrant AveraGe) probe to systematically analyze the joint representations of datasets and models by weakening modal fusion. QUAG achieves this through block - average attention weights. 3. **Limitations of existing benchmark tests**: Can current VideoQA benchmark tests fully evaluate the multimodal understanding ability of models? Are there certain biases in some datasets that cause models to perform well but actually not truly learn multimodal fusion? 4. **Performance in a highly - coupled multimodal setting**: To further verify the performance of models in a highly - coupled multimodal environment, the author constructed a new stress - test dataset CLAVI (Complements in LAnguage and VIdeo), which ensures high modal coupling by enhancing real - world videos and generating temporal questions. ### Main contributions 1. **Designing the QUAG probe**: QUAG is a lightweight and non - parametric probe used to systematically evaluate the relative contributions of various multimodal components in the multimodal fusion stage of models. 2. **Conducting experiments with QUAG and QUAG - attention**: By replacing the self - attention mechanism with QUAG - attention, the author found that the model can still achieve similar performance without fine - tuning, indicating that the model may be exploiting spurious features in the datasets rather than true multimodal fusion. 3. **Developing the CLAVI dataset**: CLAVI is a stress - test dataset with high multimodal coupling, used to evaluate the performance of models in a highly - coupled multimodal setting. Experimental results show that most models perform close to random levels on CLAVI, further confirming the limitations of existing models in learning highly - coupled multimodal representations. ### Conclusion The paper shows through QUAG and CLAVI that current VideoQA models have significant deficiencies in learning highly - coupled multimodal representations, which is a key aspect that current benchmark tests have failed to systematically evaluate. This finding is of great significance for future multimodal learning research, suggesting that more comprehensive and strict benchmark tests are needed to evaluate the multimodal understanding ability of models.

Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Robust video question answering via contrastive cross-modality representation learning

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Modality attention fusion model with hybrid multi-head self-attention for video understanding

ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

Exploring Efficient Foundational Multi-modal Models for Video Summarization

Multi-Modal interpretable automatic video captioning

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

CLIPVQA:Video Quality Assessment via CLIP

Contrastive Video Question Answering via Video Graph Transformer

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving