Abstract:The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles has recently resulted in a number of high performing models on a large panoply of vision-and-language problems (such as Visual Question Answering (VQA), image retrieval, etc.). In this paper we claim that these State-Of-The-Art (SOTA) approaches perform reasonably well in structuring information inside a single modality but, despite their impressive performances , they tend to struggle to identify fine-grained inter-modality relationships. Indeed, such relations are frequently assumed to be implicitly learned during training from application-specific losses, mostly cross-entropy for classification. While most recent works provide inductive bias for inter-modality relationships via cross attention modules, in this work, we demonstrate (1) that the latter assumption does not hold, i.e. modality alignment does not necessarily emerge automatically, and (2) that adding weak supervision for alignment between visual objects and words improves the quality of the learned models on tasks requiring reasoning. In particular , we integrate an object-word alignment loss into SOTA vision-language reasoning models and evaluate it on two tasks VQA and Language-driven Comparison of Images. We show that the proposed fine-grained inter-modality supervision significantly improves performance on both tasks. In particular, this new learning signal allows obtaining SOTA-level performances on GQA dataset (VQA task) with pre-trained models without finetuning on the task, and a new SOTA on NLVR2 dataset (Language-driven Comparison of Images). Finally, we also illustrate the impact of the contribution on the models reasoning by visualizing attention distributions.

Revisiting Weakly Supervised Pre-Training of Visual Perception Models

Exploring the Limits of Weakly Supervised Pretraining

Self-supervised Pretraining of Visual Features in the Wild

Boost Supervised Pretraining for Visual Transfer Learning: Implications of Self-Supervised Contrastive Representation Learning.

An Analysis of Unsupervised Pre-training in Light of Recent Advances

Colorization as a Proxy Task for Visual Understanding

Learning Transferable Visual Models From Natural Language Supervision

A Closer Look at Self-Supervised Lightweight Vision Transformers

Rethinking Pre-training and Self-training

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

Scaling and Benchmarking Self-Supervised Visual Representation Learning

Weakly Supervised Training of Universal Visual Concepts for Multi-domain Semantic Segmentation

Enhancing Vision-Language Pre-training with Rich Supervisions

Pre-Trained Vision-Language Models as Partial Annotators

Contrastive-Adversarial and Diffusion: Exploring pre-training and fine-tuning strategies for sulcal identification

Continual Pre-Training Mitigates Forgetting in Language and Vision

When Does Contrastive Visual Representation Learning Work?

Revisiting Sparse Convolutional Model for Visual Recognition

Investigating Self-Supervised Methods for Label-Efficient Learning

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images