Abstract:This paper introduces a new multi-modal model based on the Transformer architecture and tensor product fusion strategy, combining BERT's text vectors and ViT's image vectors to classify students' psychological conditions, with an accuracy of 93.65%. The purpose of the study is to accurately analyze the mental health status of students from various data sources. This paper discusses modal fusion methods, including early, late and intermediate fusion, to overcome the challenges of integrating multi-modal information. Ablation studies compare the performance of different models and fusion techniques, showing that the proposed model outperforms existing methods such as CLIP and ViLBERT in terms of accuracy and inference speed. Conclusions indicate that while this model has significant advantages in emotion recognition, its potential to incorporate other data modalities provides areas for future research.

What problem does this paper attempt to address?

The problem this paper attempts to address is the accurate identification of students' psychological and emotional states through multimodal fusion technology. Specifically, the researchers aim to extract information from various data sources such as text and images, and combine the Transformer architecture with tensor product fusion strategies to build a model that can efficiently and accurately analyze students' mental health status. Solving this problem is of great significance for improving students' overall well-being, academic performance, and physical and mental development. ### Background and Motivation - **Limitations of Traditional Methods**: Traditional mental health assessment methods (such as interviews with psychologists) are inefficient and cannot promptly detect students' abnormal psychological states. - **Opportunities in the Big Data Era**: With the development of educational informatization, a large amount of educational big data has been accumulated, making it possible to use this data for mental health assessment. - **Importance of Multimodal Data**: Single-modal data (such as questionnaires) may not be sufficient to fully understand students' psychological states, so it is necessary to combine multiple data sources (such as text and images) for comprehensive analysis. ### Research Objectives - **Construct a Multimodal Model**: Based on the Transformer architecture and tensor product fusion strategy, combining BERT's text vectors and ViT's image vectors, to construct an efficient and accurate multimodal fusion model. - **Improve Recognition Accuracy**: Through experimental validation, the model achieved an accuracy rate of 93.65% in the task of student emotion recognition. - **Explore Different Fusion Methods**: Discussed different multimodal fusion methods such as early fusion, late fusion, and intermediate fusion to find the optimal fusion strategy. ### Main Contributions - **Innovative Fusion Strategy**: Proposed a tensor product-based multimodal fusion method that can capture deep interaction information between different modalities. - **Superior Performance**: Comparative experiments demonstrated that the proposed model outperforms existing mainstream models (such as CLIP and ViLBERT) in terms of accuracy and inference speed. - **Practical Application Potential**: The model can be used for student mental health monitoring, timely detection of potential psychological problems, and providing decision support for educators. ### Future Research Directions - **Expand Data Modalities**: Future research can consider incorporating more data modalities, such as audio and quantitative features, to further improve the model's performance. - **Lightweight and High Performance**: Optimize the model structure to make it more lightweight and suitable for resource-limited scenarios. In summary, this paper addresses key issues in student mental health assessment by constructing an efficient multimodal fusion model, providing strong support for practical applications in the field of education.

A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection.

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

Multimodal transformer augmented fusion for speech emotion recognition

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

Emotion Recognition with Multimodal Transformer Fusion Framework Based on Acoustic and Lexical Information

Multi-head attention fusion networks for multi-modal speech emotion recognition

Multimodal Transformer Fusion for Continuous Emotion Recognition

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals

Multi-modal fusion network with complementarity and importance for emotion recognition

TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Multimodal emotion recognition model via hybrid model with improved feature level fusion on facial and EEG feature set

A Three-stage multimodal emotion recognition network based on text low-rank fusion

A multi-stage dynamical fusion network for multimodal emotion recognition

A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product

Multi-modal Emotion Recognition Based on Speech and Image.

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition

Learning Modality-Fused Representation Based on Transformer for Emotion Analysis