Multimodal modelling of human emotion using sound, image and text fusion

Seyed Sadegh Hosseini,Mohammad Reza Yamaghani,Soodabeh Poorzaker Arabani
DOI: https://doi.org/10.1007/s11760-023-02707-8
2023-08-11
Abstract:Multimodal emotion recognition and analysis are considered as an evolving field of research. The improvement of the multimodal fusion mechanism plays an important role in the more detailed recognition of the recognised emotion. The performance of the emotion recognition system was optimised, and a model for multimodal emotion recognition from audio, text and video data was proposed. First, the data were fused as a combination of video and audio, then as a combination of audio and text as binary, and finally the results were fused together. The final output included audio, text and video data, taking into account common features. The convolutional neural network and long-term and short-term memory (CNN-LSTM) were then used to extract audio. Next, the Inception-Res Net-v2 network was used to extract facial expressions from the video. The output fused data were used by LSTM as the input of the Softmax classifier to recognise the emotion of audio and video features fusion. In addition, the CNN-LSTM was combined in the form of a binary channel for learning audio emotion features. Meanwhile, a Bi-LSTM network was used to extract the text features and Softmax was used to classify the fused features. Finally, the generated results were fused together for the final classification, and the logistic regression model was used for fusion and classification. As the results showed, the recognition accuracy of the proposed method on the IEMOCAP dataset was 82.9%.
engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?