Abstract:Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss--based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.

PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric.

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

Cross-modal Semantic Interference Suppression for image-text matching

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Stable Contrastive Learning for Self-Supervised Sentence Embeddings With Pseudo-Siamese Mutual Learning

Modal Contrastive Learning based End-to-End Text Image Machine Translation

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion

Regularizing Visual Semantic Embedding with Contrastive Learning for Image-Text Matching

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

CSMF-SPC: Multimodal Sentiment Analysis Model with Effective Context Semantic Modality Fusion and Sentiment Polarity Correction

Image–Text Matching Model Based on CLIP Bimodal Encoding

Semantic Similarity Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion

Understanding Dark Scenes by Contrasting Multi-Modal Observations

Multi-Similarity Contrastive Learning

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Semantic Discriminative Metric Learning for Image Similarity Measurement

Cross-modal Image Retrieval with Deep Mutual Information Maximization

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training