Abstract:The core of cross-modal matching is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more semantic variations. So, images are usually associated with multiple textual captions in databases. Although popular symmetric embedding methods have explored numerous modal interaction approaches, they often learn toward increasing the average expression probability of multiple semantic variations within image embeddings. Consequently, information entropy in embeddings is increased, resulting in redundancy and decreased accuracy. In this work, we propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy. Specifically, we obtain a set of heterogeneous visual sub-embeddings through dynamic orthogonal constraint loss. To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution and employ a variance-aware weighting loss to assign different weights to the optimization process. In addition, we develop a Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and enhance the performance. We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role of different components by ablation studies and perform a sensitivity analysis of the hyperparameters. The qualitative analysis of visualized bidirectional retrieval and attention maps further demonstrates the ability of our method to encode semantic variations.

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models

Multi-view visual semantic embedding for cross-modal image–text retrieval

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Learning Robust Visual-Semantic Embeddings

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Learning semantic sentence representations from visually grounded language without lexical knowledge

Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking

Towards Semantic Embedding In Visual Vocabulary

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Jointly Modeling Embedding and Translation to Bridge Video and Language

Learning Video-Text Aligned Representations for Video Captioning

Universal Multimodal Representation for Language Understanding

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Learning Structured Semantic Embeddings for Visual Recognition

Multimodality-guided Visual-Caption Semantic Enhancement

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

From Captions to Visual Concepts and Back