Abstract:Machine learning advances in the last decade have relied significantly on large-scale datasets that continue to grow in size. Increasingly, those datasets also contain different data modalities. However, large multi-modal datasets are hard to annotate, and annotations may contain biases that we are often unaware of. Deep-net-based classifiers, in turn, are prone to exploit those biases and to find shortcuts. To study and quantify this concern, we introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features, i.e., modalities. Using the perceptual score, we find a surprisingly consistent trend across four popular datasets: recent, more accurate state-of-the-art multi-modal models for visual question-answering or visual dialog tend to perceive the visual data less than their predecessors. This trend is concerning as answers are hence increasingly inferred from textual cues only. Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions. We hope to spur a discussion on the perceptiveness of multi-modal models and also hope to encourage the community working on multi-modal classifiers to start quantifying perceptiveness via the proposed perceptual score.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Evaluating and quantifying the degree of dependence of multimodal models when processing different data modalities, especially whether these models overly rely on certain modalities (such as text) while ignoring other modalities (such as images)**. Specifically, the author points out that in recent years, multimodal models in visual question answering (VQA) and visual dialogue tasks, although the accuracy rate has increased, their ability to perceive visual data has decreased, and they rely more on text cues to infer answers. This trend is concerning because it means that the model may not truly understand or utilize image information. To solve this problem, the author introduces a new metric - **Perceptual Score**, which is used to evaluate the degree of dependence of the model on different data modalities. Through this method, the author hopes to reveal whether the model has biases and encourage the research community to pay more attention to the model's ability to perceive different modalities. ### Definition of Perceptual Score The perceptual score \( P_{f,D}(M_m) \) is defined as: \[ P_{f,D}(M_m) = \frac{1}{Z} \left( E_{(x,y) \sim D} \left[ P_{f,x,y}(M_m) \right] \right) \] where: - \( Z \) is a normalization factor to ensure that the scores are comparable. - \( P_{f,x,y}(M_m) \) is the sample perceptual score, which is defined as the difference between the accuracy rate when the model uses all modalities and when it does not use a specific modality \( M_m \): \[ P_{f,x,y}(M_m) = \text{Acc}_{f,x,y}(M) - \text{Acc}_{f,x,y}(M \setminus \{M_m\}) \] Here, \( \text{Acc}_{f,x,y}(M) \) represents the prediction accuracy rate of the model using all modalities \( M \) on a given sample \( (x, y) \), and \( \text{Acc}_{f,x,y}(M \setminus \{M_m\}) \) represents the prediction accuracy rate when not using the modality \( M_m \). ### Experimental Results Through experiments on multiple multimodal datasets (such as VQA, VQA - CP, VisDial, etc.), the author found that: 1. **The latest and more accurate multimodal models have a weaker ability to perceive visual data in visual question - answering tasks**, and rely more on text cues. 2. **The perceptual score can help analyze the model's biases**. For example, in the VQA - CP dataset, the state - of - the - art model CSS almost completely relies on the question itself for "yes/no" type questions, while ignoring the image information. 3. **The perceptual score can also reveal potential biases in the training data**. For example, the distribution of answers to certain question types may be different in the training set and the test set, causing the model to learn the wrong patterns. In conclusion, this paper aims to prompt researchers to pay more attention to the degree of dependence of multimodal models on different data modalities by introducing the perceptual score and take measures to reduce the model's over - reliance on certain modalities.

Perceptual Score: What Data Modalities Does Your Model Perceive?

Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Do better ImageNet classifiers assess perceptual similarity better?

Multimodal Composite Association Score: Measuring Gender Bias in Generative Multimodal Models

Perception of Visual Content: Differences Between Humans and Foundation Models

Large language models predict human sensory judgments across six modalities

Neuro-Inspired Information-Theoretic Hierarchical Perception for Multimodal Learning

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Leveraging Perceptual Scores for Dataset Pruning in Computer Vision Tasks

Simple Scalable Multimodal Semantic Segmentation Model

Vision+X: A Survey on Multimodal Learning in the Light of Data

MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception

POV Learning: Individual Alignment of Multimodal Models using Human Perception

Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Are We on the Right Way for Evaluating Large Vision-Language Models?

MultiModal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision Language Models

Multimodal Image Aesthetic Prediction with Missing Modality

Multi-Modal Aesthetic Assessment for MObile Gaming Image