Dawn of the transformer era in speech emotion recognition: closing the valence gap

Johannes Wagner,Andreas Triantafyllopoulos,Hagen Wierstorf,Maximilian Schmitt,Felix Burkhardt,Florian Eyben,Björn W. Schuller
DOI: https://doi.org/10.1109/TPAMI.2023.3263585
2023-09-08
Abstract:Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of .638 on MSP-Podcast. Furthermore, our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline and fair with respect to biological sex groups, but not towards individual speakers. Finally, we are the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during fine-tuning of the transformer layers, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. Our findings collectively paint the following picture: transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues. To make our findings reproducible, we release the best performing model to the community.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
The paper primarily explores the application of models based on the Transformer architecture in the task of Speech Emotion Recognition (SER), particularly in addressing the challenges in emotion dimension recognition. Specifically, the paper focuses on the following points: 1. **Problem Solving**: The study aims to improve the performance of emotion dimension prediction in speech emotion recognition by utilizing pre-trained Transformer models (such as wav2vec 2.0 and HuBERT), especially addressing the poor performance in the "valence" dimension. Additionally, the paper focuses on enhancing the model's generalization ability, robustness, fairness, and efficiency. 2. **Methods and Experiments**: The authors selected various pre-trained model variants and fine-tuned these models to adapt to specific emotion recognition tasks. They evaluated the models' performance on different datasets, including MSP-Podcast, IEMOCAP, and MOSI. Furthermore, the paper provides a detailed analysis of the models' robustness (e.g., ability to handle adversarial noise and other signal processing operations) and fairness (e.g., recognition performance across different genders). 3. **Results**: The research results show that Transformer-based models can significantly improve the performance of valence prediction, reaching a level comparable to multimodal methods without explicitly utilizing linguistic information. At the same time, these models also exhibit good robustness and a certain degree of fairness, although there may be differences among individual speakers. 4. **Contributions**: The main contribution of the paper is demonstrating how pre-trained Transformer models can enhance emotion recognition performance, particularly addressing the long-standing issue in valence recognition. Additionally, the authors have open-sourced their best-performing models to allow the community to reproduce their research findings. In summary, the paper attempts to address the poor performance in valence dimension prediction in speech emotion recognition using pre-trained Transformer models and provides an in-depth evaluation of these models in multiple aspects.