Student's Emotion Recognition using Multimodality and Deep Learning

M. Kalaiyarasi,B. V. V. Siva Prasad,Janjhyam Venkata Naga Ramesh,Ravindra Kumar Kushwaha,Ruchi Patel,Balajee J
DOI: https://doi.org/10.1145/3654797
IF: 1.471
2024-04-01
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:The goal of emotion detection is to find and recognise emotions in text, speech, gestures, facial expressions, and more. This paper proposes an effective multimodal emotion recognition system based on facial expressions, sentence-level text, and voice. Using public datasets, we examine face expression image classification and feature extraction. The Tri-modal fusion is used to integrate the findings and to provide the final emotion. The proposed method has been verified in classroom students, and the feelings correlate with their performance. This method categorizes students' expressions into seven emotions: happy, surprise, sad, fear, disgust, anger, and contempt. Compared to the unimodal models, the suggested multimodal network design may reach up to 65% accuracy. The proposed method can detect negative feelings such as boredom or loss of interest in the learning environment.
computer science, artificial intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The main goal of this paper is to develop a multimodal emotion recognition system based on facial expressions, sentence-level text, and speech. Specifically, this method is achieved through the following ways: 1. **Multimodal Fusion**: Combining data from three modalities—facial expressions, text, and speech—using tri-modal fusion techniques to integrate information and improve the accuracy of emotion recognition. 2. **Classroom Application**: Testing this method in a classroom environment and demonstrating that students' emotional states are correlated with their performance. 3. **Emotion Classification**: Classifying students' emotions into 7 categories: happiness, surprise, sadness, fear, disgust, anger, and contempt. 4. **Performance Evaluation**: The proposed multimodal network design can achieve an accuracy of up to 65% compared to unimodal models. The paper validates facial expression image classification and feature extraction through public datasets and demonstrates the application effect in actual classroom environments. Additionally, this method can detect negative emotions in the learning environment, such as boredom or loss of interest.