Perceptual Score: What Data Modalities Does Your Model Perceive?

Itai Gat,Idan Schwartz,Alexander Schwing
DOI: https://doi.org/10.48550/arXiv.2110.14375
2021-10-27
Abstract:Machine learning advances in the last decade have relied significantly on large-scale datasets that continue to grow in size. Increasingly, those datasets also contain different data modalities. However, large multi-modal datasets are hard to annotate, and annotations may contain biases that we are often unaware of. Deep-net-based classifiers, in turn, are prone to exploit those biases and to find shortcuts. To study and quantify this concern, we introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features, i.e., modalities. Using the perceptual score, we find a surprisingly consistent trend across four popular datasets: recent, more accurate state-of-the-art multi-modal models for visual question-answering or visual dialog tend to perceive the visual data less than their predecessors. This trend is concerning as answers are hence increasingly inferred from textual cues only. Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions. We hope to spur a discussion on the perceptiveness of multi-modal models and also hope to encourage the community working on multi-modal classifiers to start quantifying perceptiveness via the proposed perceptual score.
Machine Learning,Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Evaluating and quantifying the degree of dependence of multimodal models when processing different data modalities, especially whether these models overly rely on certain modalities (such as text) while ignoring other modalities (such as images)**. Specifically, the author points out that in recent years, multimodal models in visual question answering (VQA) and visual dialogue tasks, although the accuracy rate has increased, their ability to perceive visual data has decreased, and they rely more on text cues to infer answers. This trend is concerning because it means that the model may not truly understand or utilize image information. To solve this problem, the author introduces a new metric - **Perceptual Score**, which is used to evaluate the degree of dependence of the model on different data modalities. Through this method, the author hopes to reveal whether the model has biases and encourage the research community to pay more attention to the model's ability to perceive different modalities. ### Definition of Perceptual Score The perceptual score \( P_{f,D}(M_m) \) is defined as: \[ P_{f,D}(M_m) = \frac{1}{Z} \left( E_{(x,y) \sim D} \left[ P_{f,x,y}(M_m) \right] \right) \] where: - \( Z \) is a normalization factor to ensure that the scores are comparable. - \( P_{f,x,y}(M_m) \) is the sample perceptual score, which is defined as the difference between the accuracy rate when the model uses all modalities and when it does not use a specific modality \( M_m \): \[ P_{f,x,y}(M_m) = \text{Acc}_{f,x,y}(M) - \text{Acc}_{f,x,y}(M \setminus \{M_m\}) \] Here, \( \text{Acc}_{f,x,y}(M) \) represents the prediction accuracy rate of the model using all modalities \( M \) on a given sample \( (x, y) \), and \( \text{Acc}_{f,x,y}(M \setminus \{M_m\}) \) represents the prediction accuracy rate when not using the modality \( M_m \). ### Experimental Results Through experiments on multiple multimodal datasets (such as VQA, VQA - CP, VisDial, etc.), the author found that: 1. **The latest and more accurate multimodal models have a weaker ability to perceive visual data in visual question - answering tasks**, and rely more on text cues. 2. **The perceptual score can help analyze the model's biases**. For example, in the VQA - CP dataset, the state - of - the - art model CSS almost completely relies on the question itself for "yes/no" type questions, while ignoring the image information. 3. **The perceptual score can also reveal potential biases in the training data**. For example, the distribution of answers to certain question types may be different in the training set and the test set, causing the model to learn the wrong patterns. In conclusion, this paper aims to prompt researchers to pay more attention to the degree of dependence of multimodal models on different data modalities by introducing the perceptual score and take measures to reduce the model's over - reliance on certain modalities.