CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Tejas Srinivasan,Ting-Yun Chang,Leticia Leonor Pinto Alva,Georgios Chochlakis,Mohammad Rostami,Jesse Thomason
DOI: https://doi.org/10.48550/arXiv.2206.09059
2022-11-25
Abstract:Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive. Existing CL benchmarks have facilitated research on task adaptation and mitigating "catastrophic forgetting", but are limited to vision-only and language-only tasks. We present CLiMB, a benchmark to study the challenge of learning multimodal tasks in a CL setting, and to systematically evaluate how upstream continual learning can rapidly generalize to new multimodal and unimodal tasks. CLiMB includes implementations of several CL algorithms and a modified Vision-Language Transformer (ViLT) model that can be deployed on both multimodal and unimodal tasks. We find that common CL methods can help mitigate forgetting during multimodal task learning, but do not enable cross-task knowledge transfer. We envision that CLiMB will facilitate research on a new class of CL algorithms for this challenging multimodal setting.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges of current vision - and - language models in the Continual Learning (CL) setting. Specifically, existing vision - and - language models are usually evaluated on individual tasks or in a multi - task setting, but ignore how to continuously learn these tasks when new tasks emerge. Moreover, existing Continual Learning benchmarks mainly focus on single - modality tasks, such as visual - only or language - only tasks, and lack research on multi - modality tasks. Therefore, the paper proposes CLiMB (Continual Learning in Multimodality Benchmark), which is a benchmark aiming to study the challenges of learning multi - modality tasks in a Continual Learning environment and systematically evaluate how upstream Continual Learning can quickly generalize to new multi - modality and single - modality tasks. CLiMB includes several implementations of Continual Learning algorithms, as well as a modified Vision - Language Transformer (ViLT) model that can be deployed on multi - modality and single - modality tasks. Through experiments, the authors find that common Continual Learning methods can help alleviate the forgetting problem in the multi - modality task learning process, but do not promote knowledge transfer across tasks. This reveals the need for new research on Continual Learning strategies for vision - language tasks. In addition, current Continual Learning algorithms and multi - modality models are not suitable for low - sample adaptation to multi - modality or single - modality tasks. The authors hope that CLiMB can provide a basis for developing models and learning algorithms applicable to multi - modality Continual Learning.