Abstract:Image-text retrieval is a fundamental task in bridging the semantics between vision and language. The key challenge lies in accurately and efficiently learning the semantic alignment between two heterogeneous modalities. Existing image-text retrieval approaches can be roughly classified into two paradigms. The first independent-embedding paradigm is to learn the global embeddings of two modalities, which can achieve efficient retrieval while failing to effectively capture the cross-modal fine-grained interaction information between images and texts. The second interactive-embedding paradigm is to learn fine-grained alignment between regions and words, which can achieve accurate retrieval while sacrificing retrieval efficiency. In this paper, we propose a novel Independent Memory-Enhanced emBedding learning framework (IMEB), which introduces a lightweight middleware, i.e ., memory network, into the independent-embedding approaches to simultaneously exploit the complementary of both paradigms. Specifically, first, in the training stage, we propose a novel cross-modal association graph to learn cross-modal fine-grained interaction information. Then, we delicately design a memory-assisted embedding learning network to store these prototypical features after interaction as agents, and effectively update the memory network via two learning strategies. Finally, in the inference stage, we directly interact with these agent-level prototypical features from the memory bank, thus efficiently obtaining cross-modal memory-enhanced embeddings. In this way, our model not only effectively learns cross-modal interaction information, but also maintains the retrieval efficiency. Extensive experimental results on two benchmarks, i.e ., Flickr30K and MS-COCO, demonstrate that our IMEB performs favorably against state-of-the-art methods.

Learning Deep Structure-Preserving Image-Text Embeddings

Bilinear Joint Learning of Word and Entity Embeddings for Entity Linking.

Learning Two-Branch Neural Networks for Image-Text Matching Tasks.

Learning Structured Semantic Embeddings for Visual Recognition

Learning Robust Visual-Semantic Embeddings

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention.

Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

Learning Effective Word Embedding Using Morphological Word Similarity

Jointly Modeling Embedding and Translation to Bridge Video and Language

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Seeing the Big Picture: Deep Embedding with Contextual Evidences

Flexible margins and multiple samples learning to enhance lexical semantic similarity

NewsEmbed: Modeling News through Pre-trained Document Representations

Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Conditional Image-Text Embedding Networks

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval