Abstract:Image-text retrieval is a fundamental task in bridging the semantics between vision and language. The key challenge lies in accurately and efficiently learning the semantic alignment between two heterogeneous modalities. Existing image-text retrieval approaches can be roughly classified into two paradigms. The first independent-embedding paradigm is to learn the global embeddings of two modalities, which can achieve efficient retrieval while failing to effectively capture the cross-modal fine-grained interaction information between images and texts. The second interactive-embedding paradigm is to learn fine-grained alignment between regions and words, which can achieve accurate retrieval while sacrificing retrieval efficiency. In this paper, we propose a novel Independent Memory-Enhanced emBedding learning framework (IMEB), which introduces a lightweight middleware, i.e ., memory network, into the independent-embedding approaches to simultaneously exploit the complementary of both paradigms. Specifically, first, in the training stage, we propose a novel cross-modal association graph to learn cross-modal fine-grained interaction information. Then, we delicately design a memory-assisted embedding learning network to store these prototypical features after interaction as agents, and effectively update the memory network via two learning strategies. Finally, in the inference stage, we directly interact with these agent-level prototypical features from the memory bank, thus efficiently obtaining cross-modal memory-enhanced embeddings. In this way, our model not only effectively learns cross-modal interaction information, but also maintains the retrieval efficiency. Extensive experimental results on two benchmarks, i.e ., Flickr30K and MS-COCO, demonstrate that our IMEB performs favorably against state-of-the-art methods.

Incremental Model Enhancement Via Memory-based Contrastive Learning

Class Incremental Learning with Pre-trained Vision-Language Models

A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Image Augmentation Based Momentum Memory Intrinsic Reward for Sparse Reward Visual Scenes

Learning without Forgetting for Vision-Language Models

Multi-view class incremental learning

Class-Incremental Learning: A Survey

Imbalance Mitigation for Continual Learning via Knowledge Decoupling and Dual Enhanced Contrastive Learning

Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders

Look-Ahead Selective Plasticity for Continual Learning of Visual Tasks

Mitigating Catastrophic Forgetting in Task-Incremental Continual Learning with Adaptive Classification Criterion

Instruct Me More! Random Prompting for Visual In-Context Learning

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Class-Incremental Exemplar Compression for Class-Incremental Learning

Exemplar Masking for Multimodal Incremental Learning

Improving Meta-learning for Low-resource Text Classification and Generation Via Memory Imitation

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

Inherit With Distillation and Evolve With Contrast: Exploring Class Incremental Semantic Segmentation Without Exemplar Memory