Abstract:The rise of the metaverse and the increasing volume of heterogeneous 2D and 3D data have led to a growing demand for cross-modal retrieval, which allows users to query semantically relevant data across different modalities. Existing methods heavily rely on class labels to bridge semantic correlations, but it is expensive or even impossible to collect large-scale welll-abeled data in practice, thus making unsupervised learning more attractive and practical. However, unsupervised cross-modal learning is challenging to bridge semantic correlations across different modalities due to the lack of label information, which inevitably leads to unreliable discrimination. Based on the observations, we reveal and study a novel problem in this paper, namely unsupervised cross-modal learning with noisy pseudo labels. To address this problem, we propose a 2D-3D unsupervised multimodal learning framework that harnesses multimodal data. Our framework consists of three key components: 1) Self-matching Supervision Mechanism (SSM) warms up the model to encapsulate discrimination into the representations in a self-supervised learning manner. 2) Robust Discriminative Learning (RDL) further mines the discrimination from the learned imperfect predictions after warming up. To tackle the noise in the predicted pseudo labels, RDL leverages a novel Robust Concentrating Learning Loss (RCLL) to alleviate the influence of the uncertain samples, thus embracing robustness against noisy pseudo labels. 3) Modality-invariance Learning Mechanism (MLM) minimizes the cross-modal discrepancy to enforce SSM and RDL to produce common representations. We perform comprehensive experiments on four 2D-3D multimodal datasets, comparing our method against 14 state-of-the-art approaches, thereby demonstrating its effectiveness and superiority.

Common and Discriminative Semantic Pursuit for Multi-Modal Multi-Label Learning

Common-Individual Semantic Fusion for Multi-View Multi-Label Learning

Dual Enhancement for Multi-Label Learning with Missing Labels

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Collaboration based multi-modal multi-label learning

Detached and Interactive Multimodal Learning

Comprehensive Semi-Supervised Multi-Modal Learning.

Rethinking Modal-oriented Label Correlations for Multi-modal Multi-label Learning

Semi-Supervised Multi-Modal Learning with Incomplete Modalities

Supervised Multi-Modal Fission Learning

Label distribution for multimodal machine learning

Deep dual incomplete multi-view multi-label classification via label semantic-guided contrastive learning

Deep Multimodal Network for Multi-Label Classification.

What to align in multimodal contrastive learning?

M3LA: A Novel Approach Based on Encoder-Decoder with Attention Framework for Multi-modal Multi-label Learning

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

Joint Dictionary Learning and Semantic Constrained Latent Subspace Projection for Cross-Modal Retrieval.

Learning Discriminative Representations for Semantic Cross Media Retrieval

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization

Effective Deep Learning-Based Multi-Modal Retrieval