Abstract:Speech emotion recognition (SER) is a key branch in the field of artificial intelligence, focusing on the analysis and understanding of emotional content in human speech. It involves a multidisciplinary knowledge of acoustics, phonetics, linguistics, pattern recognition, and neurobiology, aiming to establish a connection between human speech and emotional expression. This technology has shown broad application prospects in the medical, educational, and customer service fields. With the evolution of deep learning and neural network technologies, SER research has shifted from relying on manually designed low-level descriptors (LLDs) to utilizing complex neural network models for extracting high-dimensional features. A perennial challenge for researchers has been how to comprehensively capture the rich emotional features. Given that emotional information is present in both time and frequency domains, our study introduces a novel time–frequency domain convolution module (TFCM) based on Mel-frequency cepstral coefficient (MFCC) features to deeply mine the time–frequency information of MFCCs. In the deep feature extraction phase, for the first time, we have introduced hybrid dilated convolution (HDC) into the SER field, significantly expanding the receptive field of neurons, thereby enhancing feature richness and diversity. Furthermore, we innovatively propose the residual attention-gated multilayer perceptron (RA-GMLP) structure, which combines the global feature recognition ability of GMLP with the concentrated weighting function of the multihead attention mechanism, effectively focusing on the key emotional information within the speech sequence. Through extensive experimental validation, we have demonstrated that TFCM, HDC, and RA-GMLP surpass existing advanced technologies in enhancing the accuracy of SER tasks, fully showcasing the powerful advantages of the modules we proposed.

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

[Acute abdominal pain due to ileal involvement with post-surgical enterocutaneous fistula in Churg-Strauss syndrome].

Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

Effective MLP and CNN based ensemble learning for speech emotion recognition

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Multichannel CNN-BLSTM Architecture for Speech Emotion Recognition System by Fusion of Magnitude and Phase Spectral Features Using DCCA for Consumer Applications

Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Leveraged Mel spectrograms using Harmonic and Percussive Components in Speech Emotion Recognition

Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM