Abstract:The rise of the metaverse and the increasing volume of heterogeneous 2D and 3D data have led to a growing demand for cross-modal retrieval, which allows users to query semantically relevant data across different modalities. Existing methods heavily rely on class labels to bridge semantic correlations, but it is expensive or even impossible to collect large-scale welll-abeled data in practice, thus making unsupervised learning more attractive and practical. However, unsupervised cross-modal learning is challenging to bridge semantic correlations across different modalities due to the lack of label information, which inevitably leads to unreliable discrimination. Based on the observations, we reveal and study a novel problem in this paper, namely unsupervised cross-modal learning with noisy pseudo labels. To address this problem, we propose a 2D-3D unsupervised multimodal learning framework that harnesses multimodal data. Our framework consists of three key components: 1) Self-matching Supervision Mechanism (SSM) warms up the model to encapsulate discrimination into the representations in a self-supervised learning manner. 2) Robust Discriminative Learning (RDL) further mines the discrimination from the learned imperfect predictions after warming up. To tackle the noise in the predicted pseudo labels, RDL leverages a novel Robust Concentrating Learning Loss (RCLL) to alleviate the influence of the uncertain samples, thus embracing robustness against noisy pseudo labels. 3) Modality-invariance Learning Mechanism (MLM) minimizes the cross-modal discrepancy to enforce SSM and RDL to produce common representations. We perform comprehensive experiments on four 2D-3D multimodal datasets, comparing our method against 14 state-of-the-art approaches, thereby demonstrating its effectiveness and superiority.

Collaboration based multi-modal multi-label learning

Rethinking Modal-oriented Label Correlations for Multi-modal Multi-label Learning

Dual Enhancement for Multi-Label Learning with Missing Labels

Complex Object Classification

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Deep Multimodal Network for Multi-Label Classification.

Common and Discriminative Semantic Pursuit for Multi-Modal Multi-Label Learning

Partial Multi-label Learning with Label and Feature Collaboration

Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

Learn to Combine Modalities in Multimodal Deep Learning

Comprehensive Semi-Supervised Multi-Modal Learning.

M3LA: A Novel Approach Based on Encoder-Decoder with Attention Framework for Multi-modal Multi-label Learning

Multi-Modal Multi-Instance Multi-Label Learning with Graph Convolutional Network

Multi-instance multi-label new label learning

MULTI-LABEL IMAGE RECOGNITION WITH JOINT CLASS-AWARE MAP DISENTANGLING AND LABEL CORRELATION EMBEDDING

Multi-Modal Image Annotation with Multi-Instance Multi-Label LDA.

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

Meta Multi-Instance Multi-Label learning by heterogeneous network fusion

Transformer-based Multi-Modal Learning for Multi Label Remote Sensing Image Classification

Supervised Multi-Modal Fission Learning