Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

Ishaan Singh Rawal,Alexander Matyasko,Shantanu Jaiswal,Basura Fernando,Cheston Tan
2024-06-07
Abstract:While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design $\textit{QUAG-attention}$, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models' abilities to learn highly-coupled multimodal representations. Hence, we design the $\textit{CLAVI}$ (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets (project page: <a class="link-external link-https" href="https://dissect-videoqa.github.io" rel="external noopener nofollow">this https URL</a> ).
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper attempts to explore whether the current VideoQA (Video Question Answering) Transformer models truly possess the ability of cross - modal understanding and fusion. Specifically, the paper focuses on the following issues: 1. **Whether the model truly understands multimodal data**: Existing VideoQA Transformer models perform well in standard benchmark tests, but the reasons for their success are not entirely clear. Do these models truly capture the rich multimodal structures and dynamics in videos and texts? Or do they achieve high scores by exploiting biases and spurious features in the datasets? 2. **The effectiveness of multimodal fusion**: To evaluate the degree of the model's dependence on multimodal fusion, the author designed the QUAG (QUadrant AveraGe) probe to systematically analyze the joint representations of datasets and models by weakening modal fusion. QUAG achieves this through block - average attention weights. 3. **Limitations of existing benchmark tests**: Can current VideoQA benchmark tests fully evaluate the multimodal understanding ability of models? Are there certain biases in some datasets that cause models to perform well but actually not truly learn multimodal fusion? 4. **Performance in a highly - coupled multimodal setting**: To further verify the performance of models in a highly - coupled multimodal environment, the author constructed a new stress - test dataset CLAVI (Complements in LAnguage and VIdeo), which ensures high modal coupling by enhancing real - world videos and generating temporal questions. ### Main contributions 1. **Designing the QUAG probe**: QUAG is a lightweight and non - parametric probe used to systematically evaluate the relative contributions of various multimodal components in the multimodal fusion stage of models. 2. **Conducting experiments with QUAG and QUAG - attention**: By replacing the self - attention mechanism with QUAG - attention, the author found that the model can still achieve similar performance without fine - tuning, indicating that the model may be exploiting spurious features in the datasets rather than true multimodal fusion. 3. **Developing the CLAVI dataset**: CLAVI is a stress - test dataset with high multimodal coupling, used to evaluate the performance of models in a highly - coupled multimodal setting. Experimental results show that most models perform close to random levels on CLAVI, further confirming the limitations of existing models in learning highly - coupled multimodal representations. ### Conclusion The paper shows through QUAG and CLAVI that current VideoQA models have significant deficiencies in learning highly - coupled multimodal representations, which is a key aspect that current benchmark tests have failed to systematically evaluate. This finding is of great significance for future multimodal learning research, suggesting that more comprehensive and strict benchmark tests are needed to evaluate the multimodal understanding ability of models.