Abstract:We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various sources has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the intrinsic nature of different data modalities. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions, and then present the methodological advancements categorized by the combination of data modalities, such as Vision+Text, with slightly inclined emphasis on the visual data. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, and the rhythm correspondence between video dance moves and musical beats. We hope that the exploitation of the alignment as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address a specific challenge related to the concrete multimodal task, prompting a unified multimodal machine learning framework closer to a real human intelligence system.

What problem does this paper attempt to address?

The main purpose of this paper is to provide a comprehensive review of the multimodal machine learning field and to explore the connection between the intrinsic properties of different data modalities and technical design from the perspective of data characteristics. Specifically, the paper addresses the following core issues: 1. **Analysis of the characteristics and commonalities of different data modalities**: The paper provides a detailed analysis of the characteristics of visual (images and videos), audio (music, speech, environmental sounds), and textual data, and discusses the commonalities and uniqueness among these modalities. 2. **Multimodal representation learning**: The paper discusses various approaches to multimodal representation learning, including learning strategies under supervised and unsupervised settings, and introduces commonly used network architectures for processing various modalities (such as Convolutional Neural Networks CNNs, Recurrent Neural Networks RNNs, Transformers, etc.). 3. **Classification of downstream applications**: The paper categorizes multimodal applications into discriminative and generative types, and lists a series of specific task scenarios for each. For example, applications of vision + text include image caption generation, dialogue generation, text-based image synthesis, etc.; applications of vision + audio cover audio-visual event localization, audio-video parsing, and more. 4. **Technical connections to data characteristics**: The paper particularly emphasizes the importance of considering the intrinsic properties of different data modalities in technical design. By comparing the association between technical design and data characteristics, the authors point out how to better utilize these properties to address specific challenges and to advance the development of multimodal machine learning frameworks, making them closer to true artificial intelligence systems. 5. **Future research directions**: Finally, the paper discusses the current challenges and potential research directions, aiming to guide future research efforts to better utilize the uniqueness and commonalities of different modalities to build a more unified and efficient multimodal learning framework. In summary, this paper provides valuable insights for the future development of the field by not only summarizing the current technical progress in multimodal machine learning but also by offering an in-depth analysis of the area.

Vision+X: A Survey on Multimodal Learning in the Light of Data

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Multimodal Machine Learning: A Survey and Taxonomy

Deep Multimodal Data Fusion

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Multimodal Image Synthesis and Editing: A Survey and Taxonomy

A Survey on Image-text Multimodal Models

A Survey of Multimodal Composite Editing and Retrieval

Multimodal research in vision and language: A review of current and emerging trends

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Multimodal Image Synthesis and Editing: The Generative AI Era

Recent Advances and Trends in Multimodal Deep Learning: A Review

Multimodal Alignment and Fusion: A Survey

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications