Towards Deconfounded Image-Text Matching with Causal Inference

Wenhui Li,Xinqi Su,Dan Song,Lanjun Wang,Kun Zhang,An-An Liu

DOI: https://doi.org/10.1145/3581783.3612472

2024-08-22

Abstract:Prior image-text matching methods have shown remarkable performance on many benchmark datasets, but most of them overlook the bias in the dataset, which exists in intra-modal and inter-modal, and tend to learn the spurious correlations that extremely degrade the generalization ability of the model. Furthermore, these methods often incorporate biased external knowledge from large-scale datasets as prior knowledge into image-text matching model, which is inevitable to force model further learn biased associations. To address above limitations, this paper firstly utilizes Structural Causal Models (SCMs) to illustrate how intra- and inter-modal confounders damage the image-text matching. Then, we employ backdoor adjustment to propose an innovative Deconfounded Causal Inference Network (DCIN) for image-text matching task. DCIN (1) decomposes the intra- and inter-modal confounders and incorporates them into the encoding stage of visual and textual features, effectively eliminating the spurious correlations during image-text matching, and (2) uses causal inference to mitigate biases of external knowledge. Consequently, the model can learn causality instead of spurious correlations caused by dataset bias. Extensive experiments on two well-known benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the superiority of our proposed method.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

### The Problem Addressed by the Paper The paper aims to address the bias issue in Image-Text Matching (ITM). Specifically, existing ITM methods perform well on many benchmark datasets, but they often overlook intra-modal and inter-modal biases present in the datasets. This leads to models learning spurious correlations, severely impairing their generalization ability. Additionally, these methods typically introduce biased external knowledge from large-scale datasets as prior knowledge, further reinforcing this bias. To tackle these limitations, the authors propose using Structural Causal Models (SCMs) to illustrate how intra-modal and inter-modal confounding factors harm image-text matching performance. They employ Backdoor Adjustment to propose an innovative Deconfounded Causal Inference Network (DCIN). DCIN achieves its goal through the following two steps: 1. **Decomposing Confounding Factors**: Decomposing intra-modal and inter-modal confounding factors in the image-to-text (or text-to-image) tasks within the training set and using them during the visual and textual feature encoding stages. This effectively eliminates spurious correlations brought by the training set, forcing the model to learn causal relationships rather than common co-occurrences. 2. **Mitigating External Knowledge Bias**: Using causal probability estimation to mitigate bias when introducing external knowledge. Experimental results show that the proposed DCIN method outperforms existing methods on two well-known benchmark datasets, Flickr30K and MSCOCO.

Towards Deconfounded Image-Text Matching with Causal Inference

ACMNet

Deconfounded Image Captioning: A Causal Retrospect

Causal Interventional Training for Image Recognition

Contextual Debiasing for Visual Recognition with Causal Mechanisms

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Deconfounded Video Moment Retrieval with Causal Intervention

Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching

A Survey on Causal Inference in Image Captioning

Everything Has a Cause: Leveraging Causal Inference in Legal Text Analysis

Mitigating Dataset Bias in Image Captioning Through Clip Confounder-Free Captioning Network

Explaining Deep Learning Models using Causal Inference

Evaluating and Mitigating Bias in Image Classifiers: A Causal Perspective Using Counterfactuals

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Cross-modal Semantic Interference Suppression for image-text matching

Adaptive Latent Graph Representation Learning for Image-Text Matching

Causal Intervention for Subject-Deconfounded Facial Action Unit Recognition

Reference-Aware Adaptive Network for Image-Text Matching

An End-to-End Image-Text Matching Approach Considering Semantic Uncertainty

Giving Text More Imagination Space for Image-text Matching

Unbiased Semantic Representation Learning Based on Causal Disentanglement for Domain Generalization