Abstract:An important problem with many current visio-linguistic models is that they often depend on spurious correlations. A typical example of a spurious correlation between two variables is one that is due to a third variable causing both (a “confounder”). Recent work has addressed this by adjusting for spurious correlations using a technique of deconfounding with automatically found confounders. We will refer to this technique as AutoDeconfounding . This article dives more deeply into AutoDeconfounding , and surfaces a number of issues of the original technique. First, we evaluate whether its implementation is actually equivalent to deconfounding. We provide an explicit explanation of the relation between AutoDeconfounding and the underlying causal model on which it implicitly operates, and show that additional assumptions are needed before the implementation of AutoDeconfounding can be equated to correct deconfounding. Inspired by this result, we perform ablation studies to verify to what extent the improvement on downstream visio-linguistic tasks reported by the works that implement AutoDeconfounding is due to AutoDeconfounding , and to what extent it is specifically due to the deconfounding aspect of AutoDeconfounding . We evaluate AutoDeconfounding in a way that isolates its effect, and no longer see the same improvement. We also show that tweaking AutoDeconfounding to be less related to deconfounding does not negatively affect performance on downstream visio-linguistic tasks. Furthermore, we create a human-labeled ground truth causality dataset for objects in a scene to empirically verify whether and how well confounders are found. We show that some models do indeed find more confounders than a random baseline, but also that finding more confounders is not correlated with performing better on downstream visio-linguistic tasks. Finally, we summarize the current limitations of AutoDeconfounding to solve the issue of spurious correlations and provide directions for the design of novel AutoDeconfounding methods that are aimed at overcoming these limitations.

DeVLBert: Learning Deconfounded Visio-Linguistic Representations

DeVLBert: Out-of-distribution Visio-Linguistic Pretraining with Causality

Critical Analysis of Deconfounded Pretraining to Improve Visio-Linguistic Models

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Borrowing Knowledge from Pre-trained Language Model: A New Data-efficient Visual Learning Paradigm

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning.

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Visually-Augmented Language Modeling

PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization

VLLaVO: Mitigating Visual Gap through LLMs

Learning to Decompose Visual Features with Latent Textual Prompts

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

DIRL: Domain-Invariant Representation Learning for Generalizable Semantic Segmentation

Domain-oriented Language Pre-training with Adaptive Hybrid Masking and Optimal Transport Alignment

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification