A Theory of Multimodal Learning

Zhou Lu
2023-12-16
Abstract:Human perception of the empirical world involves recognizing the diverse appearances, or 'modalities', of underlying objects. Despite the longstanding consideration of this perspective in philosophy and cognitive science, the study of multimodality remains relatively under-explored within the field of machine learning. Nevertheless, current studies of multimodal machine learning are limited to empirical practices, lacking theoretical foundations beyond heuristic arguments. An intriguing finding from the practice of multimodal learning is that a model trained on multiple modalities can outperform a finely-tuned unimodal model, even on unimodal tasks. This paper provides a theoretical framework that explains this phenomenon, by studying generalization properties of multimodal learning algorithms. We demonstrate that multimodal learning allows for a superior generalization bound compared to unimodal learning, up to a factor of $O(\sqrt{n})$, where $n$ represents the sample size. Such advantage occurs when both connection and heterogeneity exist between the modalities.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the theoretical advantages and underlying mechanisms of multimodal learning. Specifically: 1. **Lack of Theoretical Foundation**: Although multimodal machine learning has achieved significant success in practical applications (such as Gato and GPT-4), its theoretical foundation is relatively weak, lacking rigorous mathematical proof. 2. **Explanation of Multimodal Advantages**: An interesting phenomenon is that in certain tasks, multimodal models can outperform fine-tuned unimodal models even on unimodal data. This paper attempts to explain this phenomenon by studying the generalization performance of multimodal learning algorithms. 3. **Connectivity and Heterogeneity**: The paper proposes a theoretical framework that shows multimodal learning has better generalization ability than unimodal learning when there is connectivity and heterogeneity. Connectivity refers to the mapping relationships between different modalities; heterogeneity refers to the differences and complementarities between different modalities. 4. **Semi-Supervised Multi-Task Learning**: The paper also explores the effectiveness of multimodal learning in semi-supervised multi-task learning scenarios, particularly how to use a large amount of unlabeled data to learn the connections between modalities, and demonstrates the advantages of this approach through experiments. In summary, this paper aims to fill the gap in theoretical research on multimodal learning and reveal through rigorous mathematical analysis why multimodal learning can outperform unimodal learning in certain situations.