Abstract:The rise of the metaverse and the increasing volume of heterogeneous 2D and 3D data have led to a growing demand for cross-modal retrieval, which allows users to query semantically relevant data across different modalities. Existing methods heavily rely on class labels to bridge semantic correlations, but it is expensive or even impossible to collect large-scale welll-abeled data in practice, thus making unsupervised learning more attractive and practical. However, unsupervised cross-modal learning is challenging to bridge semantic correlations across different modalities due to the lack of label information, which inevitably leads to unreliable discrimination. Based on the observations, we reveal and study a novel problem in this paper, namely unsupervised cross-modal learning with noisy pseudo labels. To address this problem, we propose a 2D-3D unsupervised multimodal learning framework that harnesses multimodal data. Our framework consists of three key components: 1) Self-matching Supervision Mechanism (SSM) warms up the model to encapsulate discrimination into the representations in a self-supervised learning manner. 2) Robust Discriminative Learning (RDL) further mines the discrimination from the learned imperfect predictions after warming up. To tackle the noise in the predicted pseudo labels, RDL leverages a novel Robust Concentrating Learning Loss (RCLL) to alleviate the influence of the uncertain samples, thus embracing robustness against noisy pseudo labels. 3) Modality-invariance Learning Mechanism (MLM) minimizes the cross-modal discrepancy to enforce SSM and RDL to produce common representations. We perform comprehensive experiments on four 2D-3D multimodal datasets, comparing our method against 14 state-of-the-art approaches, thereby demonstrating its effectiveness and superiority.

Discovering and Distinguishing Multiple Visual Senses for Web Learning

Extracting Multiple Visual Senses for Web Learning

Discovering and Distinguishing Multiple Visual Senses for Polysemous Words

Semantic Image Retrieval Based on Multiple-Instance Learning

Exploiting textual queries for dynamically visual disambiguation

Multimodal Learning for Multi-Label Image Classification.

Dynamically Visual Disambiguation of Keyword-based Image Search

Poster: Cross Labelling and Learning Unknown Activities Among Multimodal Sensing Data

Semantic-Guided Representation Enhancement for Multi-Label Image Classification

Image classification by multimodal subspace learning

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

Towards Multi-Semantic Image Annotation with Graph Regularized Exclusive Group Lasso

Multi-modal Learning for Social Image Classification

Joint image and word sense discrimination for image retrieval

Graph-based Multimodal Semi-Supervised Image Classification

Learning multi-label scene classification

Effective Multi-Modal Multi-Label Learning for Automatic Image Annotation.

Vision+X: A Survey on Multimodal Learning in the Light of Data

On the Sampling of Web Images for Learning Visual Concept Classifiers

Multimodal visual dictionary learning via heterogeneous latent semantic sparse coding

Information Symmetry Matters: A Modal-Alternating Propagation Network for Few-Shot Learning