Abstract:Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss--based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.

Uniting Image and Text Deep Networks Via Bi-directional Triplet Loss for Retreival

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Cross-Modal Retrieval for Motion and Text via DopTriple Loss

Multi-Modal Retrieval Via Deep Textual-Visual Correlation Learning

Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction.

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval

Enhancing Remote Sensing Image Retrieval with Triplet Deep Metric Learning Network

TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images

A Mutually Textual and Visual Refinement Network for Image-Text Matching

Cross-modal Image-Text Retrieval with Multitask Learning

Cross-Modal Retrieval for Motion and Text Via DropTriple Loss.

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

Multi-View 3d Object Retrieval with Deep Embedding Network

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Deep Metric Learning with Hierarchical Triplet Loss.

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Mood Stabilisers, but for Lithium, are not stabilizing Moods! Bipolar Disorders: Clinical conundrums 2