Abstract:Speech emotion recognition (SER) tasks are conducted to extract emotional features from speech signals. The characteristic parameters are analyzed, and the speech emotional states are judged. At present, SER is an important aspect of artificial psychology and artificial intelligence, as it is widely implemented in many applications in the human-computer interface, medical, and entertainment fields. In this work, six transforms, namely, the synchrosqueezing transform, fractional Stockwell transform (FST), K-sine transform-dependent integrated system (KSTDIS), flexible analytic wavelet transform (FAWT), chirplet transform, and superlet transform, are initially applied to speech emotion signals. Once the transforms are applied and the features are extracted, the essential features are selected using three techniques: the Overlapping Information Feature Selection (OIFS) technique followed by two biomimetic intelligence-based optimization techniques, namely, Harris Hawks Optimization (HHO) and the Chameleon Swarm Algorithm (CSA). The selected features are then classified with the help of ten basic machine learning classifiers, with special emphasis given to the extreme learning machine (ELM) and twin extreme learning machine (TELM) classifiers. An experiment is conducted on four publicly available datasets, namely, EMOVO, RAVDESS, SAVEE, and Berlin Emo-DB. The best results are obtained as follows: the Chirplet + CSA + TELM combination obtains a classification accuracy of 80.63% on the EMOVO dataset, the FAWT + HHO + TELM combination obtains a classification accuracy of 85.76% on the RAVDESS dataset, the Chirplet + OIFS + TELM combination obtains a classification accuracy of 83.94% on the SAVEE dataset, and, finally, the KSTDIS + CSA + TELM combination obtains a classification accuracy of 89.77% on the Berlin Emo-DB dataset.

EMO-SUPERB: An In-depth Look at Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

Speech Emotion Recognition Based on Clustering Assistance

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech emotion recognition based on emotion perception

A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition

Improving Pre-trained Model-based Speech Emotion Recognition from a Low-level Speech Feature Perspective

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction

Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting

Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition