Bridging Discrete and Continuous: A Multimodal Strategy for Complex Emotion Detection

Jiehui Jia,Huan Zhang,Jinhua Liang
2024-09-12
Abstract:In the domain of human-computer interaction, accurately recognizing and interpreting human emotions is crucial yet challenging due to the complexity and subtlety of emotional expressions. This study explores the potential for detecting a rich and flexible range of emotions through a multimodal approach which integrates facial expressions, voice tones, and transcript from video clips. We propose a novel framework that maps variety of emotions in a three-dimensional Valence-Arousal-Dominance (VAD) space, which could reflect the fluctuations and positivity/negativity of emotions to enable a more variety and comprehensive representation of emotional states. We employed K-means clustering to transit emotions from traditional discrete categorization to a continuous labeling system and built a classifier for emotion recognition upon this system. The effectiveness of the proposed model is evaluated using the MER2024 dataset, which contains culturally consistent video clips from Chinese movies and TV series, annotated with both discrete and open-vocabulary emotion labels. Our experiment successfully achieved the transformation between discrete and continuous models, and the proposed model generated a more diverse and comprehensive set of emotion vocabulary while maintaining strong accuracy.
Multimedia
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address the complexity and subtlety in human emotion recognition. Specifically, it proposes a multimodal framework that spans both discrete and continuous emotion systems to detect a rich and flexible range of emotions. Traditional emotion recognition methods typically categorize emotions into a few basic categories, which have limitations when dealing with complex emotions. By integrating facial expressions, vocal intonations, and textual content from video clips, the paper introduces a new framework that maps multiple emotions into a three-dimensional Valence-Arousal-Dominance (VAD) space. This reflects the volatility and positivity/negativity of emotions, providing a more diverse and comprehensive representation of emotional states. ### Main Contributions 1. **Multimodal System**: The paper introduces a multimodal system that maps emotions into a continuous latent space, aligning with human perception. This framework outperforms existing classifiers on closed-set datasets. 2. **VAD Scoring System**: By combining the VAD scoring system with emotion classifiers, the paper learns more nuanced representations of emotional states than those learned from discrete categories. 3. **Open-Set Emotion Classification**: The paper benchmarks the open-set emotion classification task by applying the wav2vec model. Experimental results show a high correlation between the proposed model and ground truth. ### Method Overview 1. **K-means Clustering Classifier**: Used to convert between discrete and continuous emotion labels. By extracting 195 emotion words and their corresponding VAD scores, K-means clustering maps continuous VAD scores back to discrete emotion categories. 2. **ME2E Baseline Model**: Processes data from video, audio, and text modalities, extracting facial features, audio spectrograms, and textual data. These are analyzed through convolutional layers and Transformers, ultimately generating emotion predictions through a weighted fusion mechanism. 3. **ME2E Lite Improved Model**: Simplifies the original model architecture, reducing the number of parameters, making it more suitable for small-scale datasets. 4. **VAD Model**: Uses VAD scores as input, trained with a Mean Squared Error (MSE) loss function to independently predict the distribution of each VAD dimension. The predicted VAD scores are then mapped back to the K-means clustering classifier, achieving the conversion from continuous to discrete emotion labels. ### Experimental Setup and Evaluation - **Dataset**: Uses the MER2024 dataset, which includes culturally consistent video clips from Chinese movies and TV dramas, annotated with six discrete emotions and open vocabulary emotion labels. - **Model Training**: Uses the SGD optimizer, adds batch normalization and dropout layers, and dynamically adjusts the learning rate. Trained for 30 epochs on a single NVIDIA Tesla V40 GPU. - **Performance Evaluation**: Evaluates continuous emotion detection tasks using L2 distance, MSE, MAE, and Pearson Correlation Coefficient (PCC). Evaluates discrete emotion detection tasks using F1 score, precision, and recall. ### Results Analysis - **Continuous Emotion Detection**: The proposed VAD model performs well in terms of L2 distance, MSE, and MAE, with a PCC value of 0.47, indicating effective emotion value prediction. - **Discrete Emotion Detection**: Compared to the ME2E and ME2E Lite models, the proposed VAD model performs better in terms of precision and recall, with an F1 score of 0.42. - **Open Vocabulary Exploration**: By generating open vocabulary emotional responses, the model's effectiveness in capturing nuanced emotional states is validated. ### Conclusion The paper successfully achieves the conversion between discrete and continuous emotion labels through a multimodal framework and VAD scoring system, providing a more detailed and comprehensive emotion recognition method. Despite some limitations, such as the need for improved model performance and the small dataset size, this study offers valuable insights for future emotion recognition research.