Abstract:“Big data” is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand “big data”. The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person’s understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of “media”, a “mode” is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field.

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multimodal Machine Learning: A Survey and Taxonomy

Foundations of Multisensory Artificial Intelligence

A Survey of Multimodal Machine Learning

A Theory of Multimodal Learning

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Vision+X: A Survey on Multimodal Learning in the Light of Data

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

Modality Influence in Multimodal Machine Learning

Recent Advances and Trends in Multimodal Deep Learning: A Review

What is Multimodality?

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A Systematic Literature Review on Multimodal Machine Learning: Applications, Challenges, Gaps and Future Directions

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

Multimodal Systems: Taxonomy, Methods, and Challenges

Continual Learning Meets Multimodal Foundation Models: Fundamentals and Advances

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey