Abstract:Using multiple input streams simultaneously to train multimodal neural networks is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. We develop a theory of unimodal bias with multimodal deep linear networks to understand how architecture and data statistics influence this bias. This is the first work to calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We show that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. Our results, derived for multimodal linear networks, extend to nonlinear networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias. Our code is available at: <a class="link-external link-https" href="https://yedizhang.github.io/unimodal-bias.html" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the unimodal bias that occurs during the joint training process in multimodal deep linear networks. Specifically, the author is concerned with the problem that when training a neural network on multimodal data, the network may rely too much on one modality and ignore other modalities. This bias can lead to poor performance of the model in some cases, especially in the case of overparameterization, which may lead to a decline in generalization performance and permanent unimodal bias. ### Main contributions of the paper 1. **Theoretical explanation**: The author provides a theoretical explanation of why significant unimodal bias occurs in linear networks with late fusion and intermediate fusion, but is not obvious in linear networks with early fusion. 2. **Duration of the unimodal phase**: The author calculates the duration of the unimodal phase in linear networks with late and intermediate fusion as a function of network configuration, data - set correlation matrices, and initialization scale. 3. **Error attribution and superficial modality preference**: The author analyzes the error attribution phenomenon and superficial modality preference in the unimodal phase. 4. **Generalization defects and permanent unimodal bias**: The author reveals how a long unimodal phase can lead to a decline in generalization performance and permanent unimodal bias in the case of overparameterization. 5. **Numerical simulation verification**: The author verifies their theoretical findings through numerical simulation, including simulations of multimodal deep linear networks and some nonlinear networks. ### Key concepts and formulas - **Input correlation matrix** \(\Sigma\) and input - output correlation matrix \(\Sigma_{yx}\): \[ \Sigma=\begin{pmatrix} \Sigma_A&\Sigma_{AB}\\ \Sigma_{BA}&\Sigma_B \end{pmatrix}=\begin{pmatrix} \langle x_Ax_A^{\top}\rangle&\langle x_Ax_B^{\top}\rangle\\ \langle x_Bx_A^{\top}\rangle&\langle x_Bx_B^{\top}\rangle \end{pmatrix} \] \[ \Sigma_{yx}=\begin{pmatrix} \Sigma_{yxA}&\Sigma_{yxB} \end{pmatrix}=\begin{pmatrix} \langle yx_A^{\top}\rangle&\langle yx_B^{\top}\rangle \end{pmatrix} \] - **Duration of the unimodal phase**: \[ t_A = \tau\|\Sigma_{yxA}\|^{-1}\ln\frac{1}{u_0} \] \[ t_B=t_A+\tau\left(1 - \|\Sigma_{yxA}\|^{-1}\|\Sigma_{yxB}\| \right)\left\| \Sigma_{yxB}-\Sigma_{yxA}\Sigma_A^{-1}\Sigma_{AB} \right\|\ln\frac{1}{u_0} \] \[ \frac{t_B}{t_A}=1+\frac{\|\Sigma_{yxA}\| - \|\Sigma_{yxB}\|}{\left\| \Sigma_{yxB}-\Sigma_{yxA}\Sigma_A^{-1}\Sigma_{AB} \right\|} \] - **Time ratio**: \[ \frac{t

Understanding Unimodal Bias in Multimodal Deep Linear Networks

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably).

A Theory of Multimodal Learning

Learning Unseen Modality Interaction

Towards Balanced Active Learning for Multimodal Classification

Balanced Multimodal Learning via On-the-fly Gradient Modulation

Diagnosing and Re-learning for Balanced Multimodal Learning

Orthogonalized Kernel Debiased Machine Learning for Multimodal Data Analysis

Critical Learning Periods for Multisensory Integration in Deep Networks

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

On-the-fly Modulation for Balanced Multimodal Learning

Learn to Combine Modalities in Multimodal Deep Learning

Multimodal Understanding Through Correlation Maximization and Minimization

On the Benefits of Early Fusion in Multimodal Representation Learning

Latent Variable Algorithms for Multimodal Learning and Sensor Fusion

One-stage Modality Distillation for Incomplete Multimodal Learning

Multimodal Representation Learning by Alternating Unimodal Adaptation

Incomplete Multimodal Learning for Remote Sensing Data Fusion