A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product

Ao Xiang,Zongqing Qi,Han Wang,Qin Yang,Danqing Ma
2024-10-23
Abstract:This paper introduces a new multi-modal model based on the Transformer architecture and tensor product fusion strategy, combining BERT's text vectors and ViT's image vectors to classify students' psychological conditions, with an accuracy of 93.65%. The purpose of the study is to accurately analyze the mental health status of students from various data sources. This paper discusses modal fusion methods, including early, late and intermediate fusion, to overcome the challenges of integrating multi-modal information. Ablation studies compare the performance of different models and fusion techniques, showing that the proposed model outperforms existing methods such as CLIP and ViLBERT in terms of accuracy and inference speed. Conclusions indicate that while this model has significant advantages in emotion recognition, its potential to incorporate other data modalities provides areas for future research.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the accurate identification of students' psychological and emotional states through multimodal fusion technology. Specifically, the researchers aim to extract information from various data sources such as text and images, and combine the Transformer architecture with tensor product fusion strategies to build a model that can efficiently and accurately analyze students' mental health status. Solving this problem is of great significance for improving students' overall well-being, academic performance, and physical and mental development. ### Background and Motivation - **Limitations of Traditional Methods**: Traditional mental health assessment methods (such as interviews with psychologists) are inefficient and cannot promptly detect students' abnormal psychological states. - **Opportunities in the Big Data Era**: With the development of educational informatization, a large amount of educational big data has been accumulated, making it possible to use this data for mental health assessment. - **Importance of Multimodal Data**: Single-modal data (such as questionnaires) may not be sufficient to fully understand students' psychological states, so it is necessary to combine multiple data sources (such as text and images) for comprehensive analysis. ### Research Objectives - **Construct a Multimodal Model**: Based on the Transformer architecture and tensor product fusion strategy, combining BERT's text vectors and ViT's image vectors, to construct an efficient and accurate multimodal fusion model. - **Improve Recognition Accuracy**: Through experimental validation, the model achieved an accuracy rate of 93.65% in the task of student emotion recognition. - **Explore Different Fusion Methods**: Discussed different multimodal fusion methods such as early fusion, late fusion, and intermediate fusion to find the optimal fusion strategy. ### Main Contributions - **Innovative Fusion Strategy**: Proposed a tensor product-based multimodal fusion method that can capture deep interaction information between different modalities. - **Superior Performance**: Comparative experiments demonstrated that the proposed model outperforms existing mainstream models (such as CLIP and ViLBERT) in terms of accuracy and inference speed. - **Practical Application Potential**: The model can be used for student mental health monitoring, timely detection of potential psychological problems, and providing decision support for educators. ### Future Research Directions - **Expand Data Modalities**: Future research can consider incorporating more data modalities, such as audio and quantitative features, to further improve the model's performance. - **Lightweight and High Performance**: Optimize the model structure to make it more lightweight and suitable for resource-limited scenarios. In summary, this paper addresses key issues in student mental health assessment by constructing an efficient multimodal fusion model, providing strong support for practical applications in the field of education.