Abstract:As a bridge between language and vision domains, cross-modal retrieval between images and texts is a hot research topic in recent years. It remains challenging because the current image representations usually lack semantic concepts in the corresponding sentence captions. To address this issue, we introduce an intuitive and interpretable model to learn a common embedding space for alignments between images and text descriptions. Specifically, our model first incorporates the semantic relationship information into visual and textual features by performing region or word relationship reasoning. Then it utilizes the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually grow representations for the whole scene. Through the alignment learning, the learned visual representations capture key objects and semantic concepts of a scene as in the corresponding text caption. Experiments on MS-COCO [1] and Flickr30K [2] datasets validate that our method surpasses many recent state-of-the-arts with a clear margin. In addition to the effectiveness, our methods are also very efficient at the inference stage. Thanks to the effective overall representation learning with visual semantic reasoning, our methods can already achieve very strong performance by only relying on the simple inner-product to obtain similarity scores between images and captions. Experiments validate the proposed methods are more than 30-75 times faster than many recent methods with code public available. Instead of following the recent trend of using complex local matching strategies [3], [4], [5], [6] to pursue good performance while sacrificing efficiency, we show that the simple global matching strategy can still be very effective, efficient and achieve even better performance based on our framework.

Visual and textual fusion for semantically supervised region-based retrieval

Visual & textual fusion for region retrieval: from both fuzzy matching and bayesian reasoning aspects.

Using Visual Dictionary to Associate Semantic Objects in Region-Based Image Retrieval

Semantic Sensitive Region Retrieval Using Keyword-Integrated Bayesian Reasoning

Multi-view and region reasoning semantic enhancement for image-text retrieval

Multimodal Image Retrieval Based on Annotation Keywords and Visual Content

DRM: Dynamic Region Matching for Image Retrieval Using Probabilistic Fuzzy Matching and Boosting Feature Selection

Bi-Directional Image-Text Retrieval with Position Attention and Similarity Filtering

Improving Retrieval Performance by Region Constraints and Relevance Feedback

A unified framework for image retrieval using keyword and visual features

Multiple Level Visual Semantic Fusion Method for Image Re-Ranking.

A Novel Framework for Semantic-Based Video Retrieval

Semantic Pre-alignment and Ranking Learning with Unified Framework for Cross-modal Retrieval

A Probabilistic Semantic Model for Image Annotation and Multi-Modal Image Retrieval

Image Retrieval Based on Various Semantic Feature Fusion

Feature First: Advancing Image-Text Retrieval Through Improved Visual Features

An Efficient and Effective Region-Based Image Retrieval Framework

Improving Fusion of Region Features and Grid Features Via Two-Step Interaction for Image-Text Retrieval

From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion

Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning.

A novel multi-feature fusion and sparse coding-based framework for image retrieval