Abstract:“Big data” is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand “big data”. The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person’s understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of “media”, a “mode” is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field.

What is Multimodality?

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multimodal Machine Learning: A Survey and Taxonomy

A Theory of Multimodal Learning

A Survey of Multimodal Machine Learning

Vision+X: A Survey on Multimodal Learning in the Light of Data

Multimodal Grounding for Language Processing

Modality Influence in Multimodal Machine Learning

Multimodal English

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

A Systematic Literature Review on Multimodal Machine Learning: Applications, Challenges, Gaps and Future Directions

Language as the Medium: Multimodal Video Classification through text only

Multimodal research in vision and language: A review of current and emerging trends

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Towards Multimodal Content Representation

Multimodal interaction: A review