Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

Netta Madvil,Yonatan Bitton,Roy Schwartz

2023-07-06

Abstract:The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70% of the questions are solvable using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification. Based on our observations, we introduce a new test set that necessitates multiple modalities, observing a dramatic drop in model performance. Our methodology provides valuable insights into multimodal datasets and highlights the need for the development of more robust models.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Audio and Speech Processing

What problem does this paper attempt to address?

The paper attempts to address the issue of data quality assessment in multimodal datasets, specifically how to understand the roles and interrelationships of different modalities within the dataset. Specifically, the paper proposes a two-step method to analyze multimodal datasets, using a small amount of manual annotation to determine the required modality for each instance, thereby revealing the importance of different modalities in the dataset and their relationships. This method not only evaluates the impact of different modalities on model performance but also uncovers biases and shortcomings present in the dataset. The main contributions of the paper include: 1. Proposing a new two-step method to map instances in a multimodal dataset to the specific modalities required to process these instances. 2. Analyzing the importance and representation of modalities in the extended TVQA dataset, providing in-depth insights into the characteristics of the dataset. 3. Evaluating the capabilities and biases of the MERLOT Reserve model on the TVQA dataset, revealing the model's performance in handling different modalities. 4. Creating a new test set that includes questions requiring multiple modalities to answer, further validating the limitations of existing models in handling multimodal problems. Through these contributions, the paper aims to promote a deeper understanding and analysis of multimodal datasets, driving the development of more robust multimodal models.

Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

ManyModalQA: Modality Disambiguation and QA over Diverse Inputs

Perceptual Score: What Data Modalities Does Your Model Perceive?

VisualHow: Multimodal Problem Solving

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Exploring Missing Modality in Multimodal Egocentric Datasets

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models

MMR: Evaluating Reading Ability of Large Multimodal Models

Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

Augmented Behavioral Annotation Tools, with Application to Multimodal Datasets and Models: A Systematic Review

Combating Missing Modalities in Egocentric Videos at Test Time

Deep Multimodal Learning with Missing Modality: A Survey

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

Vision+X: A Survey on Multimodal Learning in the Light of Data