Abstract:As a bridge between language and vision domains, cross-modal retrieval between images and texts is a hot research topic in recent years. It remains challenging because the current image representations usually lack semantic concepts in the corresponding sentence captions. To address this issue, we introduce an intuitive and interpretable model to learn a common embedding space for alignments between images and text descriptions. Specifically, our model first incorporates the semantic relationship information into visual and textual features by performing region or word relationship reasoning. Then it utilizes the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually grow representations for the whole scene. Through the alignment learning, the learned visual representations capture key objects and semantic concepts of a scene as in the corresponding text caption. Experiments on MS-COCO [1] and Flickr30K [2] datasets validate that our method surpasses many recent state-of-the-arts with a clear margin. In addition to the effectiveness, our methods are also very efficient at the inference stage. Thanks to the effective overall representation learning with visual semantic reasoning, our methods can already achieve very strong performance by only relying on the simple inner-product to obtain similarity scores between images and captions. Experiments validate the proposed methods are more than 30-75 times faster than many recent methods with code public available. Instead of following the recent trend of using complex local matching strategies [3], [4], [5], [6] to pursue good performance while sacrificing efficiency, we show that the simple global matching strategy can still be very effective, efficient and achieve even better performance based on our framework.

Nomic Embed Vision: Expanding the Latent Space

Nomic Embed: Training a Reproducible Long Context Text Embedder

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Embed Everything: A Method for Efficiently Co-Embedding Multi-Modal Spaces

IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images

Learning a Unified Embedding for Visual Search at Pinterest

Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization

Vec2Vec: A Compact Neural Network Approach for Transforming Text Embeddings with High Fidelity

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning.

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning

Interpreting Embedding Spaces by Conceptualization

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Latent Space Cartography: Visual Analysis of Vector Space Embeddings

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

NewsEmbed: Modeling News through Pre-trained Document Representations

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer Vision