Towards Deconfounded Image-Text Matching with Causal Inference

Wenhui Li,Xinqi Su,Dan Song,Lanjun Wang,Kun Zhang,An-An Liu
DOI: https://doi.org/10.1145/3581783.3612472
2024-08-22
Abstract:Prior image-text matching methods have shown remarkable performance on many benchmark datasets, but most of them overlook the bias in the dataset, which exists in intra-modal and inter-modal, and tend to learn the spurious correlations that extremely degrade the generalization ability of the model. Furthermore, these methods often incorporate biased external knowledge from large-scale datasets as prior knowledge into image-text matching model, which is inevitable to force model further learn biased associations. To address above limitations, this paper firstly utilizes Structural Causal Models (SCMs) to illustrate how intra- and inter-modal confounders damage the image-text matching. Then, we employ backdoor adjustment to propose an innovative Deconfounded Causal Inference Network (DCIN) for image-text matching task. DCIN (1) decomposes the intra- and inter-modal confounders and incorporates them into the encoding stage of visual and textual features, effectively eliminating the spurious correlations during image-text matching, and (2) uses causal inference to mitigate biases of external knowledge. Consequently, the model can learn causality instead of spurious correlations caused by dataset bias. Extensive experiments on two well-known benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the superiority of our proposed method.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### The Problem Addressed by the Paper The paper aims to address the bias issue in Image-Text Matching (ITM). Specifically, existing ITM methods perform well on many benchmark datasets, but they often overlook intra-modal and inter-modal biases present in the datasets. This leads to models learning spurious correlations, severely impairing their generalization ability. Additionally, these methods typically introduce biased external knowledge from large-scale datasets as prior knowledge, further reinforcing this bias. To tackle these limitations, the authors propose using Structural Causal Models (SCMs) to illustrate how intra-modal and inter-modal confounding factors harm image-text matching performance. They employ Backdoor Adjustment to propose an innovative Deconfounded Causal Inference Network (DCIN). DCIN achieves its goal through the following two steps: 1. **Decomposing Confounding Factors**: Decomposing intra-modal and inter-modal confounding factors in the image-to-text (or text-to-image) tasks within the training set and using them during the visual and textual feature encoding stages. This effectively eliminates spurious correlations brought by the training set, forcing the model to learn causal relationships rather than common co-occurrences. 2. **Mitigating External Knowledge Bias**: Using causal probability estimation to mitigate bias when introducing external knowledge. Experimental results show that the proposed DCIN method outperforms existing methods on two well-known benchmark datasets, Flickr30K and MSCOCO.