CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Tejas Srinivasan,Ting-Yun Chang,Leticia Leonor Pinto Alva,Georgios Chochlakis,Mohammad Rostami,Jesse Thomason

DOI: https://doi.org/10.48550/arXiv.2206.09059

2022-11-25

Abstract:Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive. Existing CL benchmarks have facilitated research on task adaptation and mitigating "catastrophic forgetting", but are limited to vision-only and language-only tasks. We present CLiMB, a benchmark to study the challenge of learning multimodal tasks in a CL setting, and to systematically evaluate how upstream continual learning can rapidly generalize to new multimodal and unimodal tasks. CLiMB includes implementations of several CL algorithms and a modified Vision-Language Transformer (ViLT) model that can be deployed on both multimodal and unimodal tasks. We find that common CL methods can help mitigate forgetting during multimodal task learning, but do not enable cross-task knowledge transfer. We envision that CLiMB will facilitate research on a new class of CL algorithms for this challenging multimodal setting.

Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges of current vision - and - language models in the Continual Learning (CL) setting. Specifically, existing vision - and - language models are usually evaluated on individual tasks or in a multi - task setting, but ignore how to continuously learn these tasks when new tasks emerge. Moreover, existing Continual Learning benchmarks mainly focus on single - modality tasks, such as visual - only or language - only tasks, and lack research on multi - modality tasks. Therefore, the paper proposes CLiMB (Continual Learning in Multimodality Benchmark), which is a benchmark aiming to study the challenges of learning multi - modality tasks in a Continual Learning environment and systematically evaluate how upstream Continual Learning can quickly generalize to new multi - modality and single - modality tasks. CLiMB includes several implementations of Continual Learning algorithms, as well as a modified Vision - Language Transformer (ViLT) model that can be deployed on multi - modality and single - modality tasks. Through experiments, the authors find that common Continual Learning methods can help alleviate the forgetting problem in the multi - modality task learning process, but do not promote knowledge transfer across tasks. This reveals the need for new research on Continual Learning strategies for vision - language tasks. In addition, current Continual Learning algorithms and multi - modality models are not suitable for low - sample adaptation to multi - modality or single - modality tasks. The authors hope that CLiMB can provide a basis for developing models and learning algorithms applicable to multi - modality Continual Learning.

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

From MNIST to ImageNet and Back: Benchmarking Continual Curriculum Learning

Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge Distillation

CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning

Interactive Continual Learning: Fast and Slow Thinking

The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Towards Continual Knowledge Learning of Language Models

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks

Scalable Performance Analysis for Vision-Language Models

Exploring Continual Learning for Code Generation Models

TiC-CLIP: Continual Training of CLIP Models

How Much Can CLIP Benefit Vision-and-Language Tasks?

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models

Don't Stop Learning: Towards Continual Learning for the CLIP Model

Towards Multimodal In-Context Learning for Vision & Language Models

Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models

CTP: Towards Vision-Language Continual Pretraining Via Compatible Momentum Contrast and Topology Preservation

ICL-TSVD: Bridging Theory and Practice in Continual Learning with Pre-trained Models