Abstract:Due to the complex nature of human emotions and the diversity of emotion representation methods in humans, emotion recognition is a challenging field. In this research, three input modalities, namely text, audio (speech), and video, are employed to generate multimodal feature vectors. For generating features for each of these modalities, pre-trained Transformer models with fine-tuning are utilized. In each modality, a Transformer model is used with transfer learning to extract feature and emotional structure. These features are then fused together, and emotion recognition is performed using a classifier. To select an appropriate fusion method and classifier, various feature-level and decision-level fusion techniques have been experimented with, and ultimately, the best model, which combines feature-level fusion by concatenating feature vectors and classification using a Support Vector Machine on the IEMOCAP multimodal dataset, achieves an accuracy of 75.42%. Keywords: Multimodal Emotion Recognition, IEMOCAP, Self-Supervised Learning, Transfer Learning, Transformer.

What problem does this paper attempt to address?

The paper mainly discusses the problem of multimodal emotion recognition and generates multimodal feature vectors using three input modalities: text, speech, and video. In the study, the authors fine-tuned a pre-trained Transformer model to extract features and emotion structures for each modality. These features were then fused and identified using a classifier for emotion recognition. In choosing the appropriate fusion method and classifier, the paper tried various feature-level and decision-level fusion techniques, and finally found that concatenating the feature vectors and using Support Vector Machines (SVM) achieved an accuracy of 75.42% on the IEMOCAP multimodal dataset. The research points out that although most emotion recognition work focuses on single-modality models, the ideal system should be multimodal, simulating the human sensory system. The paper compares different combinations of modalities, such as facial expressions and speech, and discusses the challenges faced by multimodal learning, such as modality selection, missing data handling, synchronization, and integration of different modalities. In related work, the paper mentions the progress of using deep learning and Transformer models in audio and visual modalities fusion. The research also compares different fusion methods, such as feature-level fusion and decision-level fusion, and finds that feature-level fusion performs better in certain cases. In the experiment section, the paper uses the IEMOCAP dataset to fine-tune the pre-trained Transformer model and selects BERT, wav2vec2.0, and videoMAE to process the text, speech, and video modalities respectively. Through early fusion (feature-level fusion) and late fusion (decision-level fusion), and using SVM, XGBoost, and a two-layer neural network as classifiers, it finally determines that the early fusion method of concatenating feature vectors combined with SVM classifier achieves high accuracy in multimodal emotion recognition.

Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

Emotion Recognition in Videos via Fusing Multimodal Features.

Multilevel Transformer For Multimodal Emotion Recognition

Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks

Multi-head attention fusion networks for multi-modal speech emotion recognition

Facial Emotion Recognition with Inter-Modality-Attention-Transformer-Based Self-Supervised Learning

Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

Emotion Recognition with Pre-Trained Transformers Using Multimodal Signals

TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

Multimodal Transformer Fusion for Emotion Recognition: A Survey

Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning

Multimodal modelling of human emotion using sound, image and text fusion

Bi-Modal Bi-Task Emotion Recognition Based on Transformer Architecture

Multimodal transformer augmented fusion for speech emotion recognition

Multimodal Transformer Fusion for Continuous Emotion Recognition

Multimodal Transformer with Learnable Frontend and Self Attention for Emotion Recognition

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Multimodal Emotion Recognition Based on Deep Temporal Features Using Cross-Modal Transformer and Self-Attention

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models

Multimodal Emotion Recognition Using Different Fusion Techniques