Learning Unseen Modality Interaction

Yunhua Zhang,Hazel Doughty,Cees G.M. Snoek
2023-10-25
Abstract:Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. Project website: <a class="link-external link-https" href="https://xiaobai1217.github.io/Unseen-Modality-Interaction/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning,Multimedia
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the generalization problem of **unseen modality combinations** in multimodal learning. Traditional multimodal learning methods assume that all modality combinations of interest are available during the training process in order to learn cross - modality correspondences. However, in the real world, this assumption is not always valid because it may be infeasible to collect sufficient training samples containing all sensor data. In addition, sensors may be added or removed at any time, resulting in inconsistent modality combinations during training and inference. To solve this problem, the author proposes a new research topic - **Unseen Modality Interaction**, that is, how to effectively predict these unseen modality combinations during inference when some modality combinations have not been seen during training. Specifically, the author proposes a method to achieve this by projecting the multi - dimensional features of different modalities into a common space and retaining as much discriminative information as possible. In addition, in order to reduce the over - fitting of the model to unreliable modality combinations during the training process, the author introduces a pseudo - supervision mechanism to indicate the reliability of each modality prediction. ### Main contributions 1. **Define and solve the Unseen Modality Interaction problem**: This is a new challenge in the field of multimodal learning, that is, how to effectively predict these unseen modality combinations during inference when some modality combinations have not been seen during training. 2. **Propose a new model architecture**: Map the features of different modalities to a common space through the feature projection module, and use a two - branch prediction mechanism to reduce over - fitting, thereby improving the generalization ability of the model for unseen modality combinations. 3. **Verify the effectiveness of the method**: Through experiments on multiple tasks (such as multimodal video classification, robot state regression, and multimedia retrieval), the effectiveness and robustness of this method are proved. ### Formula representation - **Feature projection module**: - Input: Features of different modalities \( F_m=[f_1,\cdots,f_{k_m}] \), where \( F_m\in\mathbb{R}^{k_m\times d_m} \) - Output: Features in the common space \( \hat{F}_m = F_m^T O_m \), where \( \hat{F}_m\in\mathbb{R}^{k^*\times d^*} \) - Attention matrix \( O_m = [o_1,\cdots,o_{k_m}] \), where \( O_m\in\mathbb{R}^{k_m\times k^*} \) - **Feature alignment loss**: - For each modality \( m \), minimize the Euclidean distance between the average feature \( \bar{f}_m=\frac{1}{k^*}\sum_i\hat{f}_{m,i} \) and the learnable token \( u_{n_m} \): \[ L_{\text{align}}=\sum_{m\in M_1}\|\bar{f}_m - u_{n_m}\|_2^2 \] - **Two - branch prediction**: - Loss function: \[ L=\lambda L_{\text{align}}+L_{\text{supervised}}+\alpha L_{\text{pseudo}} \] - Where \( \lambda \) and \( \alpha \) are hyperparameters used to balance the weights of different loss terms. Through these methods, the author successfully solves the generalization problem of unseen modality combinations in multimodal learning and shows its effectiveness on multiple tasks.