Abstract:This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to optimize large - scale pre - trained models (PTMs) to adapt to specific tasks, especially the speech emotion recognition task. Although the existing large - scale pre - trained models perform well on a variety of tasks, they are usually designed for general tasks, and there is still room for improvement in their performance on specific tasks. In addition, these large - scale models are huge in size, resulting in difficulties in deployment in practical applications. Therefore, this research aims to generate task - specific pre - trained models that are both compact and effective. Specifically, the paper proposes an improved emotion - specific pre - trained encoder named Vesper, which is specifically used for the speech emotion recognition task. Vesper conducts further self - supervised pre - training based on WavLM and introduces the following several innovations: 1. **Emotion - guided masking strategy**: In order to enhance the sensitivity to emotion information, Vesper adopts a new emotion - guided masking strategy. By analyzing the energy (i.e., volume and intensity) of the input speech signal, it identifies the regions that may contain emotion information and applies masking only within these regions. This enables the model to better focus on emotional features. 2. **Hierarchical self - supervision**: In order to improve the ability to capture acoustic and semantic representations, Vesper adopts a hierarchical self - supervision method. The shallow network is responsible for learning acoustic features, while the deep network focuses on semantic features. This hierarchical supervision helps to more comprehensively understand the emotion information in speech. 3. **Cross - layer self - supervision**: To further enrich the final output representation, Vesper also introduces a cross - layer self - supervision mechanism. This method ensures that all levels of the model can effectively extract acoustic and semantic information, making the final representation more balanced and comprehensive. Through these methods, Vesper not only outperforms existing models in performance but also has a more compact model scale and is more suitable for practical applications. Experimental results show that Vesper performs better than large pre - trained models such as WavLM on multiple speech emotion recognition data sets. ### Formula summary - **RMS energy calculation formula**: \[ E(f)=\sqrt{\frac{1}{L}\sum_{l = 1}^{L}|A_f(l)|^2} \] where \(A\) is the input audio, \(A_f\) represents the audio segment of the \(f\) - th frame, \(L\) is the frame length, and \(E(f)\) is the RMS energy of the \(f\) - th frame. - **Loss function**: - Shallow - layer loss: \[ L_l=\sum_{m\in I_p}\text{MSE}(P_1(\text{Tr}_V^{\frac{N}{2}}(x'_{\frac{N}{2}-1}))_m,\text{Tr}_W^{\frac{M}{2}}(y_{\frac{M}{2}-1})_m) \] - Deep - layer loss: \[ L_h=\sum_{m\in I_w}\text{MSE}(P_2(\text{Tr}_V^N(x''_{N - 1}))_m,\text{Tr}_W^M(y_{M - 1})_m) \] - Cross - layer loss: \[ L_x=\sum_m\text{MSE}(P_3(\text{Tr}_V^N(x''_{N - 1}))_m,\text{Tr}_W^{\frac{M}{2}}(y_{\frac{M}{2}-1})_m) \] - **Total loss function**: \[ L=\lambda_lL_l+\lambda_hL_h+\lambda_xL_x \] where \(\lambda_l\), \(\lambda_h\), \(\lambda_x\) are hyper - parameters for balancing different loss components. Through these methods, Vesper significantly improves the performance of the speech emotion recognition task while maintaining high efficiency.

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

A Pre-trained Audio-Visual Transformer for Emotion Recognition

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

BLSP-Emo: Towards Empathetic Large Speech-Language Models

Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram

Bimodal Speech Emotion Recognition Using Pre-Trained Language Models

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Emotion-Detecting Based Model Selection For Emotional Speech Recognition