Abstract:Massive numbers of new images are uploaded to the internet every day. However, existing cross-modal retrieval (CMR) approaches struggle to accommodate this continuously growing data. The prevalent practice involves periodically retraining or fine-tuning a new model based on the accumulated data, which in turn invalidates billions of indexed features extracted by the previous model and incurs another substantial computational cost to extract new features for the entire data archive. Is it possible to develop a retrieval model that effectively captures the knowledge of upcoming sessions while preserving the discriminative power of features extracted in previous sessions? In this paper, we propose an online continual learning setup, OC-CMR, to formalize the data-incremental growth challenge faced by cross-modal retrieval systems. It consists of two key settings: 1) Similar to the real-world scenarios, the streaming multi-modal data arrives once per session; 2) Consider the computational costs, each instance of archived data has its feature extracted only once and by its corresponding model in its session. Based on our OC-CMR, we perform in-depth evaluations of state-of-the-art cross-modal retrieval methods and observe that they suffer from representational shift and collapse due to the catastrophic forgetting. To address this issue, we propose the Continual Cross-Modal Retrieval (C2MR) approach, which learns a shared common space not only across modalities but also sessions and maintains relationships between samples from distinct sessions via cross-modal relational coherence and semantic representation coordination. We construct two new benchmarks by adapting MS-COCO and Flickr30K datasets to the OC-CMR setting, providing a more challenging evaluation framework for CMR tasks. Experimental results demonstrate that our method effectively alleviates forgetting and significantly outperforms combinations of previous arts in cross-modal retrieval and continual learning.

Knowledge Decomposition and Replay: A Novel Cross-modal Image-Text Retrieval Continual Learning Method

Continual Learning With Knowledge Distillation: A Survey

CORE: Mitigating Catastrophic Forgetting in Continual Learning through Cognitive Replay

Continual learning for cross-modal image-text retrieval based on domain-selective attention

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

C2MR: Continual Cross-Modal Retrieval for Streaming Multi-modal Data

Continual Vision-Language Retrieval Via Dynamic Knowledge Rectification

Imbalance Mitigation for Continual Learning via Knowledge Decoupling and Dual Enhanced Contrastive Learning

Relational Experience Replay: Continual Learning by Adaptively Tuning Task-wise Relationship

Continual Learning Through Retrieval and Imagination.

Hierarchical Correlations Replay for Continual Learning

Continual Learning via Manifold Expansion Replay

Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning

Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting

Continual Learning: Less Forgetting, More OOD Generalization via Adaptive Contrastive Replay

Online Continual Learning with Declarative Memory

Adaptive online continual multi-view learning

Online Continual Learning Via the Meta-learning Update with Multi-scale Knowledge Distillation and Data Augmentation

M EMORY R EPLAY WITH D ATA C OMPRESSION FOR C ONTINUAL L EARNING

Continual Referring Expression Comprehension Via Dual Modular Memorization.

A Benchmark and Empirical Analysis for Replay Strategies in Continual Learning