Abstract:Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity , connections , and interactions that have driven subsequent innovations, and propose a taxonomy of six core technical challenges: representation , alignment , reasoning , generation , transference , and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily explores the core theoretical and technical challenges in the field of multimodal machine learning and proposes a new classification system to systematically describe these challenges. Specifically, the paper attempts to address the following issues: 1. **Basic Principles of Multimodal Machine Learning**: - **Heterogeneity**: Different modalities contain information that varies in quality, structure, and representation. - **Connections**: There is often complementary information between modalities. - **Interactions**: Modal elements interact during task reasoning to generate new information. 2. **Six Core Technical Challenges**: - **Representation**: How to learn representations that reflect the heterogeneity and interconnections of modal elements. - **Alignment**: How to identify the connections and interactions between modal elements. - **Reasoning**: How to combine cross-modal evidence for reasoning. - **Generation**: How to learn generative processes to produce original modal data that reflects cross-modal interactions. - **Transference**: How to transfer knowledge between modalities, especially from resource-rich to resource-poor modalities. - **Quantification**: How to better understand the challenges of modal heterogeneity, connections, and the learning process through empirical and theoretical research. 3. **Details of Specific Challenges**: - **Representation**: Includes representation fusion, representation coordination, and representation decoupling. - **Alignment**: Includes discrete alignment, continuous alignment, and contextualization. - **Reasoning**: Includes structure modeling, intermediate concepts, inference paradigms, and external knowledge. - **Generation**: Includes summarization, translation, and creation. - **Transference**: Includes cross-modal transfer, co-learning, and model induction. - **Quantification**: Includes research on heterogeneity, interconnections, and the learning process. The paper aims to provide readers with a comprehensive overview of the field of multimodal machine learning by synthesizing extensive multimodal research and proposing key open problems that need to be addressed in future research.

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multimodal Machine Learning: A Survey and Taxonomy

Foundations of Multisensory Artificial Intelligence

A Theory of Multimodal Learning

Vision+X: A Survey on Multimodal Learning in the Light of Data

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

Modality Influence in Multimodal Machine Learning

What is Multimodality?

Recent Advances and Trends in Multimodal Deep Learning: A Review

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

A Systematic Literature Review on Multimodal Machine Learning: Applications, Challenges, Gaps and Future Directions

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

Multimodal Systems: Taxonomy, Methods, and Challenges

Continual Learning Meets Multimodal Foundation Models: Fundamentals and Advances

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey

Cross-Modal Knowledge Discovery, Inference, and Challenges.