Abstract:Recent technological advancements have enhanced our ability to collect and analyze rich multimodal data (e.g., speech, video, and eye gaze) to better inform learning and training experiences. While previous reviews have focused on parts of the multimodal pipeline (e.g., conceptual models and data fusion), a comprehensive literature review on the methods informing multimodal learning and training environments has not been conducted. This literature review provides an in-depth analysis of research methods in these environments, proposing a taxonomy and framework that encapsulates recent methodological advances in this field and characterizes the multimodal domain in terms of five modality groups: Natural Language, Video, Sensors, Human-Centered, and Environment Logs. We introduce a novel data fusion category -- mid fusion -- and a graph-based technique for refining literature reviews, termed citation graph pruning. Our analysis reveals that leveraging multiple modalities offers a more holistic understanding of the behaviors and outcomes of learners and trainees. Even when multimodality does not enhance predictive accuracy, it often uncovers patterns that contextualize and elucidate unimodal data, revealing subtleties that a single modality may miss. However, there remains a need for further research to bridge the divide between multimodal learning and training studies and foundational AI research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to more comprehensively analyze students' behaviors and performances in learning and training environments through multimodal methods, so as to provide more meaningful support and promote students' learning and training effects. Specifically, the paper focuses on the following aspects: 1. **Data Types**: What kinds of data are necessary for understanding learners' behaviors and performances? How can these data support meaningful educational intervention measures? 2. **Multimodal Data Analysis**: How to effectively collect, fuse and analyze data from different modalities (such as natural language, video, sensors, human - centered data and environmental logs) to obtain a more comprehensive understanding of learners' behaviors and results? 3. **Data Fusion Methods**: What are the advantages and disadvantages of existing data fusion methods (early fusion, late fusion, hybrid fusion) in multimodal learning and training environments? Is a new data fusion classification needed to better reflect the current research progress? 4. **Research Methods**: What are the current research methods in multimodal learning and training environments? What challenges and deficiencies do these methods have in data collection, analysis and interpretation? 5. **Future Research Directions**: How to further bridge the gap between multimodal learning and training research and basic artificial intelligence research? What aspects should future research focus on? Through a systematic review of existing literature, the paper proposes a comprehensive framework and classification system, aiming to provide guidance and support for the research of multimodal learning and training environments. Specific contributions include: - Proposing a new data fusion classification - mid fusion, which is between early fusion and late fusion and is suitable for partially processed features. - Introducing a literature screening method based on citation graphs - citation graph pruning, which is used to programmatically screen the corpus of literature reviews. - Providing a detailed classification system, covering five types of modalities (natural language, video, sensors, human - centered data and environmental logs), and analyzing the characteristics and applications of each type of modality. - Outlining the main research methods in multimodal learning and training environments, including classification, regression, clustering, qualitative analysis, statistical analysis, network analysis and pattern extraction, etc. Through these contributions, the paper hopes to provide a comprehensive reference framework for researchers in the field of multimodal learning and training, and promote the further development of this field.

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Recent Advances and Trends in Multimodal Deep Learning: A Review

Multimodal Data Fusion in Learning Analytics: A Systematic Review

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Multimodal data indicators for capturing cognitive, motivational, and emotional learning processes: A systematic literature review

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

A Systematic Literature Review on Multimodal Machine Learning: Applications, Challenges, Gaps and Future Directions

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Vision+X: A Survey on Multimodal Learning in the Light of Data

Deep Multimodal Data Fusion

Multimodality in meta-learning: A comprehensive survey

Multimodal Machine Learning: A Survey and Taxonomy

A Review on Methods and Applications in Multimodal Deep Learning

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A review on data fusion in multimodal learning analytics and educational data mining

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey

Literature Review on Co-Located Collaboration Modeling Using Multimodal Learning Analytics—Can We Go the Whole Nine Yards?

Multimodal Image Synthesis and Editing: A Survey and Taxonomy