Combining Multi-scale and Self-Supervised Features for Speech Emotion Recognition

Fei Lei,Zhuorui Wang,Yibo Ding
DOI: https://doi.org/10.23919/CCC58697.2023.10239882
2023-01-01
Abstract:Speech emotion recognition (SER) plays a crucial role in classifying emotional information conveyed through audio signals, providing more accurate and convenient solutions for human-computer interaction, emotional analysis and other fields. The recent research has focused on training Transformer-based models on human-annotated emotional datasets to capture long-range dependencies by modeling fixed-scale feature representations, and processing time-varying spectral features as images. However, extracting efficient and robust common speech features from small-scale datasets is challenging, and dealing with scale variance is difficult due to the lack of inherent inductive bias (IB). To address these challenges, this paper proposes a noval architecture that extracts Multi-Scale features from raw signals and embeds them into a Self-supervised Features, i.e., MSSF. Technically, this paper first designs a spatial pyramid reduction cell that combines rich multi-scale speech features by utilizing multiple convolutions of different kernel sizes. Next, these features are embedded into a pre-trained self-supervised model to obtain multi-scale, discriminative, and common features for SER tasks. The predicted labels are then output through the final classification head. Additionally, this paper designs a convolution block in parallel, and its features are fused and fed into the multi-scale features. Finally, MSSF is fine-tuned on the benchmark corpus IEMOCAP for four emotions. Compared to previous methods, our proposed model demonstrates improvements on four common metrics, indicating its superiority.
What problem does this paper attempt to address?