Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?

Orchid Chetia Phukan,Gautam Siddharth Kashyap,Arun Balaji Buduru,Rajesh Sharma
2024-07-11
Abstract:Availability of representations from pre-trained models (PTMs) have facilitated substantial progress in speech emotion recognition (SER). Particularly, representations from PTM trained for paralinguistic speech processing have shown state-of-the-art (SOTA) performance for SER. However, such paralinguistic PTM representations haven't been evaluated for SER in linguistic environments other than English. Also, paralinguistic PTM representations haven't been investigated in benchmarks such as SUPERB, EMO-SUPERB, ML-SUPERB for SER. This makes it difficult to access the efficacy of paralinguistic PTM representations for SER in multiple languages. To fill this gap, we perform a comprehensive comparative study of five SOTA PTM representations. Our results shows that paralinguistic PTM (TRILLsson) representations performs the best and this performance can be attributed to its effectiveness in capturing pitch, tone and other speech characteristics more effectively than other PTM representations.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
The paper primarily addresses the following issues: 1. **Evaluating and comparing the performance of Pre-trained Models (PTM) in multilingual environments for Speech Emotion Recognition (SER)**: Existing research indicates that pre-trained models for paralinguistic speech processing tasks exhibit excellent speech emotion recognition performance in English environments, but the performance of these models in other language environments has not been fully evaluated. 2. **Filling gaps in existing benchmarks**: Although some pre-trained models have achieved good results in benchmarks like SUPERB and EMO-SUPERB, pre-trained models specifically for paralinguistic tasks have not been systematically evaluated on these benchmarks, especially for multilingual speech emotion recognition tasks. 3. **Validating the effectiveness of paralinguistic pre-trained models**: Previous research by Shor et al. has shown that in English environments, paralinguistic pre-trained models can achieve state-of-the-art speech emotion recognition performance, but the effectiveness of such models in other languages remains to be verified. In summary, this paper aims to evaluate and compare the speech emotion recognition capabilities of five state-of-the-art pre-trained models (including the paralinguistic pre-trained model TRILLsson) in various language environments through a series of experiments. It also explores whether these models can effectively capture key speech features that influence emotion recognition in different languages (such as pitch, intonation, etc.). Additionally, the paper attempts to fill the current research gap regarding the use of paralinguistic pre-trained models in multilingual speech emotion recognition tasks.