Abstract:Speech emotion recognition (SER) tasks are conducted to extract emotional features from speech signals. The characteristic parameters are analyzed, and the speech emotional states are judged. At present, SER is an important aspect of artificial psychology and artificial intelligence, as it is widely implemented in many applications in the human-computer interface, medical, and entertainment fields. In this work, six transforms, namely, the synchrosqueezing transform, fractional Stockwell transform (FST), K-sine transform-dependent integrated system (KSTDIS), flexible analytic wavelet transform (FAWT), chirplet transform, and superlet transform, are initially applied to speech emotion signals. Once the transforms are applied and the features are extracted, the essential features are selected using three techniques: the Overlapping Information Feature Selection (OIFS) technique followed by two biomimetic intelligence-based optimization techniques, namely, Harris Hawks Optimization (HHO) and the Chameleon Swarm Algorithm (CSA). The selected features are then classified with the help of ten basic machine learning classifiers, with special emphasis given to the extreme learning machine (ELM) and twin extreme learning machine (TELM) classifiers. An experiment is conducted on four publicly available datasets, namely, EMOVO, RAVDESS, SAVEE, and Berlin Emo-DB. The best results are obtained as follows: the Chirplet + CSA + TELM combination obtains a classification accuracy of 80.63% on the EMOVO dataset, the FAWT + HHO + TELM combination obtains a classification accuracy of 85.76% on the RAVDESS dataset, the Chirplet + OIFS + TELM combination obtains a classification accuracy of 83.94% on the SAVEE dataset, and, finally, the KSTDIS + CSA + TELM combination obtains a classification accuracy of 89.77% on the Berlin Emo-DB dataset.

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment

Multi-Scale Temporal Transformer For Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition

Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition

A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

Multilevel Transformer For Multimodal Emotion Recognition

Bi-Modal Bi-Task Emotion Recognition Based on Transformer Architecture

A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition

Hierarchical Transformer Network for Utterance-Level Emotion Recognition

Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Cross-Language Speech Emotion Recognition Using Multimodal Dual Attention Transformers

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness