Abstract:Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of .638 on MSP-Podcast. Furthermore, our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline and fair with respect to biological sex groups, but not towards individual speakers. Finally, we are the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during fine-tuning of the transformer layers, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. Our findings collectively paint the following picture: transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues. To make our findings reproducible, we release the best performing model to the community.

What problem does this paper attempt to address?

The paper primarily explores the application of models based on the Transformer architecture in the task of Speech Emotion Recognition (SER), particularly in addressing the challenges in emotion dimension recognition. Specifically, the paper focuses on the following points: 1. **Problem Solving**: The study aims to improve the performance of emotion dimension prediction in speech emotion recognition by utilizing pre-trained Transformer models (such as wav2vec 2.0 and HuBERT), especially addressing the poor performance in the "valence" dimension. Additionally, the paper focuses on enhancing the model's generalization ability, robustness, fairness, and efficiency. 2. **Methods and Experiments**: The authors selected various pre-trained model variants and fine-tuned these models to adapt to specific emotion recognition tasks. They evaluated the models' performance on different datasets, including MSP-Podcast, IEMOCAP, and MOSI. Furthermore, the paper provides a detailed analysis of the models' robustness (e.g., ability to handle adversarial noise and other signal processing operations) and fairness (e.g., recognition performance across different genders). 3. **Results**: The research results show that Transformer-based models can significantly improve the performance of valence prediction, reaching a level comparable to multimodal methods without explicitly utilizing linguistic information. At the same time, these models also exhibit good robustness and a certain degree of fairness, although there may be differences among individual speakers. 4. **Contributions**: The main contribution of the paper is demonstrating how pre-trained Transformer models can enhance emotion recognition performance, particularly addressing the long-standing issue in valence recognition. Additionally, the authors have open-sourced their best-performing models to allow the community to reproduce their research findings. In summary, the paper attempts to address the poor performance in valence dimension prediction in speech emotion recognition using pre-trained Transformer models and provides an in-depth evaluation of these models in multiple aspects.

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation

Multi-Scale Temporal Transformer For Speech Emotion Recognition

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

A Pre-trained Audio-Visual Transformer for Emotion Recognition

An enhanced speech emotion recognition using vision transformer

Multilevel Transformer For Multimodal Emotion Recognition

Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers

Speech Emotion Recognition with Complementary Acoustic Representations.

Cross-corpus speech emotion recognition with transformers: Leveraging handcrafted features and data augmentation

Hierarchical Transformer Network for Utterance-Level Emotion Recognition

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

DST: Deformable Speech Transformer for Emotion Recognition

Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition

A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition