Abstract:Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. Project website: <a class="link-external link-https" href="https://xiaobai1217.github.io/Unseen-Modality-Interaction/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the generalization problem of **unseen modality combinations** in multimodal learning. Traditional multimodal learning methods assume that all modality combinations of interest are available during the training process in order to learn cross - modality correspondences. However, in the real world, this assumption is not always valid because it may be infeasible to collect sufficient training samples containing all sensor data. In addition, sensors may be added or removed at any time, resulting in inconsistent modality combinations during training and inference. To solve this problem, the author proposes a new research topic - **Unseen Modality Interaction**, that is, how to effectively predict these unseen modality combinations during inference when some modality combinations have not been seen during training. Specifically, the author proposes a method to achieve this by projecting the multi - dimensional features of different modalities into a common space and retaining as much discriminative information as possible. In addition, in order to reduce the over - fitting of the model to unreliable modality combinations during the training process, the author introduces a pseudo - supervision mechanism to indicate the reliability of each modality prediction. ### Main contributions 1. **Define and solve the Unseen Modality Interaction problem**: This is a new challenge in the field of multimodal learning, that is, how to effectively predict these unseen modality combinations during inference when some modality combinations have not been seen during training. 2. **Propose a new model architecture**: Map the features of different modalities to a common space through the feature projection module, and use a two - branch prediction mechanism to reduce over - fitting, thereby improving the generalization ability of the model for unseen modality combinations. 3. **Verify the effectiveness of the method**: Through experiments on multiple tasks (such as multimodal video classification, robot state regression, and multimedia retrieval), the effectiveness and robustness of this method are proved. ### Formula representation - **Feature projection module**: - Input: Features of different modalities \( F_m=[f_1,\cdots,f_{k_m}] \), where \( F_m\in\mathbb{R}^{k_m\times d_m} \) - Output: Features in the common space \( \hat{F}_m = F_m^T O_m \), where \( \hat{F}_m\in\mathbb{R}^{k^*\times d^*} \) - Attention matrix \( O_m = [o_1,\cdots,o_{k_m}] \), where \( O_m\in\mathbb{R}^{k_m\times k^*} \) - **Feature alignment loss**: - For each modality \( m \), minimize the Euclidean distance between the average feature \( \bar{f}_m=\frac{1}{k^*}\sum_i\hat{f}_{m,i} \) and the learnable token \( u_{n_m} \): \[ L_{\text{align}}=\sum_{m\in M_1}\|\bar{f}_m - u_{n_m}\|_2^2 \] - **Two - branch prediction**: - Loss function: \[ L=\lambda L_{\text{align}}+L_{\text{supervised}}+\alpha L_{\text{pseudo}} \] - Where \( \lambda \) and \( \alpha \) are hyperparameters used to balance the weights of different loss terms. Through these methods, the author successfully solves the generalization problem of unseen modality combinations in multimodal learning and shows its effectiveness on multiple tasks.

Learning Unseen Modality Interaction

Multimodal Representation Learning by Alternating Unimodal Adaptation

A Theory of Multimodal Learning

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

Achieving Cross Modal Generalization with Multimodal Unified Representation.

Rethinking Missing Modality Learning: From a Decoding View

One-stage Modality Distillation for Incomplete Multimodal Learning

Detached and Interactive Multimodal Learning

Incomplete Multimodal Learning for Remote Sensing Data Fusion

UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Learn to Combine Modalities in Multimodal Deep Learning

Rethinking Missing Modality Learning from a Decoding Perspective

Robust Multimodal Representation under Uncertain Missing Modalities

Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach

Comprehensive Semi-Supervised Multi-Modal Learning.

Privileged Modality Learning Via Multimodal Hallucination

RoMo: Robust Unsupervised Multimodal Learning with Noisy Pseudo Labels

Learning from the Global View: Supervised Contrastive Learning of Multimodal Representation

Learning Cross-Modality Interaction for Robust Depth Perception of Autonomous Driving

UCSL: Toward Unsupervised Common Subspace Learning for Cross-Modal Image Classification