Abstract:The rise of the metaverse and the increasing volume of heterogeneous 2D and 3D data have led to a growing demand for cross-modal retrieval, which allows users to query semantically relevant data across different modalities. Existing methods heavily rely on class labels to bridge semantic correlations, but it is expensive or even impossible to collect large-scale welll-abeled data in practice, thus making unsupervised learning more attractive and practical. However, unsupervised cross-modal learning is challenging to bridge semantic correlations across different modalities due to the lack of label information, which inevitably leads to unreliable discrimination. Based on the observations, we reveal and study a novel problem in this paper, namely unsupervised cross-modal learning with noisy pseudo labels. To address this problem, we propose a 2D-3D unsupervised multimodal learning framework that harnesses multimodal data. Our framework consists of three key components: 1) Self-matching Supervision Mechanism (SSM) warms up the model to encapsulate discrimination into the representations in a self-supervised learning manner. 2) Robust Discriminative Learning (RDL) further mines the discrimination from the learned imperfect predictions after warming up. To tackle the noise in the predicted pseudo labels, RDL leverages a novel Robust Concentrating Learning Loss (RCLL) to alleviate the influence of the uncertain samples, thus embracing robustness against noisy pseudo labels. 3) Modality-invariance Learning Mechanism (MLM) minimizes the cross-modal discrepancy to enforce SSM and RDL to produce common representations. We perform comprehensive experiments on four 2D-3D multimodal datasets, comparing our method against 14 state-of-the-art approaches, thereby demonstrating its effectiveness and superiority.

Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning

Multimodal Representation Learning by Alternating Unimodal Adaptation

Multimodal Meta-Learning for Cold-Start Sequential Recommendation.

Cold-start active learning for image classification

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Towards Balanced Active Learning for Multimodal Classification

Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Text-centric Alignment for Multi-Modality Learning

Effective Multimodal Reinforcement Learning with Modality Alignment and Importance Enhancement

CaMML: Context-Aware Multimodal Learner for Large Models

Cross-Modal Data Augmentation for Tasks of Different Modalities

Communication-Efficient Multimodal Federated Learning: Joint Modality and Client Selection

Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting

Cross-modal contrastive learning for multimodal sentiment recognition

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

CMCLRec: Cross-modal Contrastive Learning for User Cold-start Sequential Recommendation

Deep Multimodal Learning with Missing Modality: A Survey