Abstract:The rise of the metaverse and the increasing volume of heterogeneous 2D and 3D data have led to a growing demand for cross-modal retrieval, which allows users to query semantically relevant data across different modalities. Existing methods heavily rely on class labels to bridge semantic correlations, but it is expensive or even impossible to collect large-scale welll-abeled data in practice, thus making unsupervised learning more attractive and practical. However, unsupervised cross-modal learning is challenging to bridge semantic correlations across different modalities due to the lack of label information, which inevitably leads to unreliable discrimination. Based on the observations, we reveal and study a novel problem in this paper, namely unsupervised cross-modal learning with noisy pseudo labels. To address this problem, we propose a 2D-3D unsupervised multimodal learning framework that harnesses multimodal data. Our framework consists of three key components: 1) Self-matching Supervision Mechanism (SSM) warms up the model to encapsulate discrimination into the representations in a self-supervised learning manner. 2) Robust Discriminative Learning (RDL) further mines the discrimination from the learned imperfect predictions after warming up. To tackle the noise in the predicted pseudo labels, RDL leverages a novel Robust Concentrating Learning Loss (RCLL) to alleviate the influence of the uncertain samples, thus embracing robustness against noisy pseudo labels. 3) Modality-invariance Learning Mechanism (MLM) minimizes the cross-modal discrepancy to enforce SSM and RDL to produce common representations. We perform comprehensive experiments on four 2D-3D multimodal datasets, comparing our method against 14 state-of-the-art approaches, thereby demonstrating its effectiveness and superiority.

Comprehensive Semi-Supervised Multi-Modal Learning.

Semi-Supervised Multi-Modal Learning with Incomplete Modalities

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Semi-Supervised Multi-Modal Clustering and Classification with Incomplete Modalities

Multi-Modal Curriculum Learning for Semi-Supervised Image Classification

Multimodal Semi-Supervised Learning for 3D Objects

Common and Discriminative Semantic Pursuit for Multi-Modal Multi-Label Learning

Calibrating Multimodal Learning

Detached and Interactive Multimodal Learning

Multi-Modal Self-Supervised Learning for Recommendation

Collaboration based multi-modal multi-label learning

TCGM: an Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning

Supervised Multi-Modal Fission Learning

On the Causal Sufficiency and Necessity of Multi-Modal Representation Learning

Learning Unseen Modality Interaction

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

SMIL: Multimodal Learning with Severely Missing Modality

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

CaMML: Context-Aware Multimodal Learner for Large Models