Abstract:Massive numbers of new images are uploaded to the internet every day. However, existing cross-modal retrieval (CMR) approaches struggle to accommodate this continuously growing data. The prevalent practice involves periodically retraining or fine-tuning a new model based on the accumulated data, which in turn invalidates billions of indexed features extracted by the previous model and incurs another substantial computational cost to extract new features for the entire data archive. Is it possible to develop a retrieval model that effectively captures the knowledge of upcoming sessions while preserving the discriminative power of features extracted in previous sessions? In this paper, we propose an online continual learning setup, OC-CMR, to formalize the data-incremental growth challenge faced by cross-modal retrieval systems. It consists of two key settings: 1) Similar to the real-world scenarios, the streaming multi-modal data arrives once per session; 2) Consider the computational costs, each instance of archived data has its feature extracted only once and by its corresponding model in its session. Based on our OC-CMR, we perform in-depth evaluations of state-of-the-art cross-modal retrieval methods and observe that they suffer from representational shift and collapse due to the catastrophic forgetting. To address this issue, we propose the Continual Cross-Modal Retrieval (C2MR) approach, which learns a shared common space not only across modalities but also sessions and maintains relationships between samples from distinct sessions via cross-modal relational coherence and semantic representation coordination. We construct two new benchmarks by adapting MS-COCO and Flickr30K datasets to the OC-CMR setting, providing a more challenging evaluation framework for CMR tasks. Experimental results demonstrate that our method effectively alleviates forgetting and significantly outperforms combinations of previous arts in cross-modal retrieval and continual learning.

Coordinated and Specific Restricted Boltzmann Machine for Cross-Modal Retrieval

Deep Correspondence Restricted Boltzmann Machine for Cross-Modal Retrieval

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Dual Modality Collaborative Learning for Cross-Source Remote Sensing Retrieval

Cross-Modal Learning With Images, Texts And Their Semantics

Bi-CMR: Bidirectional Reinforcement Guided Hashing for Effective Cross-Modal Retrieval

Adversarial Cross-Modal Retrieval

Modality-dependent Cross-media Retrieval

C2MR: Continual Cross-Modal Retrieval for Streaming Multi-modal Data

Scalable Deep Multimodal Learning for Cross-Modal Retrieval

Cross‐modal Semantic Correlation Learning by Bi‐CNN Network

Soft Contrastive Cross-Modal Retrieval

Cross-Modal Retrieval Using Multiordered Discriminative Structured Subspace Learning.

Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network

A Comprehensive Survey on Cross-modal Retrieval.

Domain Separation Network for Cross-Modal Retrieval.

Continuum Regression for Cross-Modal Multimedia Retrieval

Cross-modal Retrieval Via Memory Network.

Cross-Modal Retrieval by Class Information and Listwise Ranking

Cross-Modal Retrieval With Label Completion

Cross-modal Retrieval with Dual Optimization