Abstract:“Big data” is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand “big data”. The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person’s understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of “media”, a “mode” is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field.

Multi-Modal Knowledge Representation: A Survey.

Multi-Modal Knowledge Graph Construction and Application: A Survey

A Survey of Multi-modal Knowledge Graphs: Technologies and Trends

Cross-Modal Knowledge Discovery, Inference, and Challenges.

Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey

A Survey on Multimodal Knowledge Graphs: Construction, Completion and Applications

Multi-modal Recommendation Based on Knowledge Graph

A survey on knowledge-enhanced multimodal learning

Representation and Fusion Based on Knowledge Graph in Multi-Modal Semantic Communication

Multimodal Machine Learning: A Survey and Taxonomy

A Survey of Vision and Language Related Multi-Modal Task

A Survey of Transformer-Based Multimodal Pre-Trained Modals.

Vision+X: A Survey on Multimodal Learning in the Light of Data

Multimodal Knowledge Triple Extraction Based on Representation Learning

A survey of multi-modal learning theory

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Deep Multimodal Representation Learning: A Survey

Multi-Modal Knowledge Representation Learning Via Webly-Supervised Relationships Mining

Deep Multi-Modal Sets

A Survey of Multimodal Machine Learning