Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

Minoo Shayaninasab, Bagher Babaali
2024-02-12
Abstract:Due to the complex nature of human emotions and the diversity of emotion representation methods in humans, emotion recognition is a challenging field. In this research, three input modalities, namely text, audio (speech), and video, are employed to generate multimodal feature vectors. For generating features for each of these modalities, pre-trained Transformer models with fine-tuning are utilized. In each modality, a Transformer model is used with transfer learning to extract feature and emotional structure. These features are then fused together, and emotion recognition is performed using a classifier. To select an appropriate fusion method and classifier, various feature-level and decision-level fusion techniques have been experimented with, and ultimately, the best model, which combines feature-level fusion by concatenating feature vectors and classification using a Support Vector Machine on the IEMOCAP multimodal dataset, achieves an accuracy of 75.42%. Keywords: Multimodal Emotion Recognition, IEMOCAP, Self-Supervised Learning, Transfer Learning, Transformer.
Artificial Intelligence
What problem does this paper attempt to address?
The paper mainly discusses the problem of multimodal emotion recognition and generates multimodal feature vectors using three input modalities: text, speech, and video. In the study, the authors fine-tuned a pre-trained Transformer model to extract features and emotion structures for each modality. These features were then fused and identified using a classifier for emotion recognition. In choosing the appropriate fusion method and classifier, the paper tried various feature-level and decision-level fusion techniques, and finally found that concatenating the feature vectors and using Support Vector Machines (SVM) achieved an accuracy of 75.42% on the IEMOCAP multimodal dataset. The research points out that although most emotion recognition work focuses on single-modality models, the ideal system should be multimodal, simulating the human sensory system. The paper compares different combinations of modalities, such as facial expressions and speech, and discusses the challenges faced by multimodal learning, such as modality selection, missing data handling, synchronization, and integration of different modalities. In related work, the paper mentions the progress of using deep learning and Transformer models in audio and visual modalities fusion. The research also compares different fusion methods, such as feature-level fusion and decision-level fusion, and finds that feature-level fusion performs better in certain cases. In the experiment section, the paper uses the IEMOCAP dataset to fine-tune the pre-trained Transformer model and selects BERT, wav2vec2.0, and videoMAE to process the text, speech, and video modalities respectively. Through early fusion (feature-level fusion) and late fusion (decision-level fusion), and using SVM, XGBoost, and a two-layer neural network as classifiers, it finally determines that the early fusion method of concatenating feature vectors combined with SVM classifier achieves high accuracy in multimodal emotion recognition.