Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?

Orchid Chetia Phukan,Gautam Siddharth Kashyap,Arun Balaji Buduru,Rajesh Sharma

2024-07-11

Abstract:Availability of representations from pre-trained models (PTMs) have facilitated substantial progress in speech emotion recognition (SER). Particularly, representations from PTM trained for paralinguistic speech processing have shown state-of-the-art (SOTA) performance for SER. However, such paralinguistic PTM representations haven't been evaluated for SER in linguistic environments other than English. Also, paralinguistic PTM representations haven't been investigated in benchmarks such as SUPERB, EMO-SUPERB, ML-SUPERB for SER. This makes it difficult to access the efficacy of paralinguistic PTM representations for SER in multiple languages. To fill this gap, we perform a comprehensive comparative study of five SOTA PTM representations. Our results shows that paralinguistic PTM (TRILLsson) representations performs the best and this performance can be attributed to its effectiveness in capturing pitch, tone and other speech characteristics more effectively than other PTM representations.

Audio and Speech Processing,Computation and Language,Sound

What problem does this paper attempt to address?

The paper primarily addresses the following issues: 1. **Evaluating and comparing the performance of Pre-trained Models (PTM) in multilingual environments for Speech Emotion Recognition (SER)**: Existing research indicates that pre-trained models for paralinguistic speech processing tasks exhibit excellent speech emotion recognition performance in English environments, but the performance of these models in other language environments has not been fully evaluated. 2. **Filling gaps in existing benchmarks**: Although some pre-trained models have achieved good results in benchmarks like SUPERB and EMO-SUPERB, pre-trained models specifically for paralinguistic tasks have not been systematically evaluated on these benchmarks, especially for multilingual speech emotion recognition tasks. 3. **Validating the effectiveness of paralinguistic pre-trained models**: Previous research by Shor et al. has shown that in English environments, paralinguistic pre-trained models can achieve state-of-the-art speech emotion recognition performance, but the effectiveness of such models in other languages remains to be verified. In summary, this paper aims to evaluate and compare the speech emotion recognition capabilities of five state-of-the-art pre-trained models (including the paralinguistic pre-trained model TRILLsson) in various language environments through a series of experiments. It also explores whether these models can effectively capture key speech features that influence emotion recognition in different languages (such as pitch, intonation, etc.). Additionally, the paper attempts to fill the current research gap regarding the use of paralinguistic pre-trained models in multilingual speech emotion recognition tasks.

Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?

A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition

Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

EMO-SUPERB: An In-depth Look at Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech

Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance

INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition

Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

Testing Correctness, Fairness, and Robustness of Speech Emotion Recognition Models

What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition

Towards Discriminative Representation Learning for Speech Emotion Recognition

PERSONA: An Application for Emotion Recognition, Gender Recognition and Age Estimation

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

A Comprehensive Review of Speech Emotion Recognition Systems