Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities

Asif Iqbal Middya,Baibhav Nag,Sarbani Roy
DOI: https://doi.org/10.1016/j.knosys.2022.108580
2022-05-01
Abstract:Emotion identification based on multimodal data (e.g., audio, video, text, etc.) is one of the most demanding and important research fields, with various uses. In this context, this research work has conducted a rigorous exploration of model-level fusion to find out the optimal multimodal model for emotion recognition using audio and video modalities. More specifically, separate novel feature extractor networks for audio and video data are proposed. After that, an optimal multimodal emotion recognition model is created by fusing audio and video features at the model level. The performances of the proposed models are assessed based on two benchmark multimodal datasets namely Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Surrey Audio-Visual Expressed Emotion (SAVEE) using various performance metrics. The proposed models achieve high predictive accuracies of 99% and 86% on the SAVEE and RAVDESS datasets, respectively. The effectiveness of the models are also verified by comparing their performances with the existing emotion recognition models. Some case studies are also conducted to explore the model's ability to capture the variability of emotional states of the speakers in publicly available real-world audio-visual media.
computer science, artificial intelligence
What problem does this paper attempt to address?