Abstract:Zero-Shot Cross-Modal Retrieval (ZS-CMR) is an emerging research hotspot that aims to retrieve data of new classes across different modality data. It is challenging for not only the heterogeneous distributions across different modalities, but also the inconsistent semantics across seen and unseen classes. A handful of recently proposed methods typically borrow the idea from zero-shot learning, i.e., exploiting word embeddings of class labels (i.e., class-embeddings) as common semantic space, and using generative adversarial network (GAN) to capture the underlying multimodal data structures, as well as strengthen relations between input data and semantic space to generalize across seen and unseen classes. In this paper, we propose a novel method termed Learning Cross-Aligned Latent Embeddings (LCALE) as an alternative to these GAN based methods for ZS-CMR. Unlike using the class-embeddings as the semantic space, our method seeks for a shared low-dimensional latent space of input multimodal features and class-embeddings by modality-specific variational autoencoders. Notably, we align the distributions learned from multimodal input features and from class-embeddings to construct latent embeddings that contain the essential cross-modal correlation associated with unseen classes. Effective cross-reconstruction and cross-alignment criterions are further developed to preserve class-discriminative information in latent space, which benefits the efficiency for retrieval and enable the knowledge transfer to unseen classes. We evaluate our model using four benchmark datasets on image-text retrieval tasks and one largescale dataset on image-sketch retrieval tasks. The experimental results show that our method establishes the new state-of-the-art performance for both tasks on all datasets.

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

Zero-shot Learning with Regularized Cross-Modality Ranking.

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

Harvesting Deep Models For Cross-Lingual Image Annotation

Saliency-based Multi-View Mixed Language Training for Zero-shot Cross-lingual Classification.

ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Evaluating and explaining training strategies for zero-shot cross-lingual news sentiment analysis

Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

A Benchmark of Zero-Shot Cross-Lingual Task-Oriented Dialogue Based on Adversarial Contrastive Representation Learning

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Zero-Shot Cross-Lingual Knowledge Transfer in VQA Via Multimodal Distillation

Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval

Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval

Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations

Inter-Modality Fusion Based Attention for Zero-Shot Cross-Modal Retrieval.