Abstract:In the domain of human-computer interaction, accurately recognizing and interpreting human emotions is crucial yet challenging due to the complexity and subtlety of emotional expressions. This study explores the potential for detecting a rich and flexible range of emotions through a multimodal approach which integrates facial expressions, voice tones, and transcript from video clips. We propose a novel framework that maps variety of emotions in a three-dimensional Valence-Arousal-Dominance (VAD) space, which could reflect the fluctuations and positivity/negativity of emotions to enable a more variety and comprehensive representation of emotional states. We employed K-means clustering to transit emotions from traditional discrete categorization to a continuous labeling system and built a classifier for emotion recognition upon this system. The effectiveness of the proposed model is evaluated using the MER2024 dataset, which contains culturally consistent video clips from Chinese movies and TV series, annotated with both discrete and open-vocabulary emotion labels. Our experiment successfully achieved the transformation between discrete and continuous models, and the proposed model generated a more diverse and comprehensive set of emotion vocabulary while maintaining strong accuracy.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the complexity and subtlety in human emotion recognition. Specifically, it proposes a multimodal framework that spans both discrete and continuous emotion systems to detect a rich and flexible range of emotions. Traditional emotion recognition methods typically categorize emotions into a few basic categories, which have limitations when dealing with complex emotions. By integrating facial expressions, vocal intonations, and textual content from video clips, the paper introduces a new framework that maps multiple emotions into a three-dimensional Valence-Arousal-Dominance (VAD) space. This reflects the volatility and positivity/negativity of emotions, providing a more diverse and comprehensive representation of emotional states. ### Main Contributions 1. **Multimodal System**: The paper introduces a multimodal system that maps emotions into a continuous latent space, aligning with human perception. This framework outperforms existing classifiers on closed-set datasets. 2. **VAD Scoring System**: By combining the VAD scoring system with emotion classifiers, the paper learns more nuanced representations of emotional states than those learned from discrete categories. 3. **Open-Set Emotion Classification**: The paper benchmarks the open-set emotion classification task by applying the wav2vec model. Experimental results show a high correlation between the proposed model and ground truth. ### Method Overview 1. **K-means Clustering Classifier**: Used to convert between discrete and continuous emotion labels. By extracting 195 emotion words and their corresponding VAD scores, K-means clustering maps continuous VAD scores back to discrete emotion categories. 2. **ME2E Baseline Model**: Processes data from video, audio, and text modalities, extracting facial features, audio spectrograms, and textual data. These are analyzed through convolutional layers and Transformers, ultimately generating emotion predictions through a weighted fusion mechanism. 3. **ME2E Lite Improved Model**: Simplifies the original model architecture, reducing the number of parameters, making it more suitable for small-scale datasets. 4. **VAD Model**: Uses VAD scores as input, trained with a Mean Squared Error (MSE) loss function to independently predict the distribution of each VAD dimension. The predicted VAD scores are then mapped back to the K-means clustering classifier, achieving the conversion from continuous to discrete emotion labels. ### Experimental Setup and Evaluation - **Dataset**: Uses the MER2024 dataset, which includes culturally consistent video clips from Chinese movies and TV dramas, annotated with six discrete emotions and open vocabulary emotion labels. - **Model Training**: Uses the SGD optimizer, adds batch normalization and dropout layers, and dynamically adjusts the learning rate. Trained for 30 epochs on a single NVIDIA Tesla V40 GPU. - **Performance Evaluation**: Evaluates continuous emotion detection tasks using L2 distance, MSE, MAE, and Pearson Correlation Coefficient (PCC). Evaluates discrete emotion detection tasks using F1 score, precision, and recall. ### Results Analysis - **Continuous Emotion Detection**: The proposed VAD model performs well in terms of L2 distance, MSE, and MAE, with a PCC value of 0.47, indicating effective emotion value prediction. - **Discrete Emotion Detection**: Compared to the ME2E and ME2E Lite models, the proposed VAD model performs better in terms of precision and recall, with an F1 score of 0.42. - **Open Vocabulary Exploration**: By generating open vocabulary emotional responses, the model's effectiveness in capturing nuanced emotional states is validated. ### Conclusion The paper successfully achieves the conversion between discrete and continuous emotion labels through a multimodal framework and VAD scoring system, providing a more detailed and comprehensive emotion recognition method. Despite some limitations, such as the need for improved model performance and the small dataset size, this study offers valuable insights for future emotion recognition research.

Bridging Discrete and Continuous: A Multimodal Strategy for Complex Emotion Detection

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions.

Multimodal Emotion Recognition by Combining Physiological Signals and Facial Expressions: a Preliminary Study.

Multi-modal emotion analysis from facial expressions and electroencephalogram.

Multi-Modal Multi-Cultural Dimensional Continues Emotion Recognition In Dyadic Interactions

Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.

Going Beyond Closed Sets: A Multimodal Perspective for Video Emotion Analysis.

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

Unifying the Discrete and Continuous Emotion labels for Speech Emotion Recognition

Survey of deep emotion recognition in dynamic data using facial, speech and textual cues

Emotion Dictionary Learning with Modality Attentions for Mixed Emotion Exploration

A multimodal shared network with a cross-modal distribution constraint for continuous emotion recognition

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition.

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space.

Research And Application Analysis of Multimodal Emotion Recognition Methods Based on Speech, Text, And Facial Expressions

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition

Multimodal Adaptive Emotion Transformer with Flexible Modality Inputs on A Novel Dataset with Continuous Labels

EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis