An enhanced speech emotion recognition using vision transformer

Samson Akinpelu,Serestina Viriri,Adekanmi Adegun

DOI: https://doi.org/10.1038/s41598-024-63776-4

IF: 4.6

2024-06-09

Scientific Reports

Abstract:In human–computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users' emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model's capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.

multidisciplinary sciences

What problem does this paper attempt to address?

The paper aims to address two core issues in Speech Emotion Recognition (SER): 1. **Improving emotion recognition accuracy**: Enhancing the accuracy of emotion recognition from speech signals by improving existing techniques. 2. **Reducing computational complexity**: Decreasing the computational resources required to achieve efficient emotion recognition. To achieve these goals, the researchers propose a novel approach that utilizes a lightweight Vision Transformer (ViT) model for emotion recognition. This method focuses on extracting features from Mel spectrograms and inputting them into the ViT model with a self-attention mechanism to achieve accurate emotion recognition. Compared to traditional Convolutional Neural Networks (CNN), ViT can directly learn global features from input images and capture spatial dependencies, which helps the model understand the emotion-rich characteristics of speech signals. The researchers conducted experimental evaluations on two benchmark datasets: the Toronto Emotional Speech Set (TESS) and the Berlin Emotional Database (EMODB). The experimental results show that the proposed model achieved accuracies of 98%, 91%, and 93% on the TESS and EMODB datasets, respectively, significantly improving the performance of emotion recognition and demonstrating its applicability in real-world applications. Additionally, the study points out that the non-overlapping block feature extraction method can substantially enhance the accuracy of speech emotion recognition, opening up new possibilities for integrating Vision Transformer models into SER systems.

An enhanced speech emotion recognition using vision transformer

Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers

Self-attention Transfer Networks for Speech Emotion Recognition

Attention on Emotions: A Vision Transformer Approach to Advancing Facial Expression Recognition

Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation

MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer

Emotion Recognition Using Transformers with Masked Learning

Dawn of the transformer era in speech emotion recognition: closing the valence gap

Cross-corpus speech emotion recognition with transformers: Leveraging handcrafted features and data augmentation

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

Facial Expression Recognition Based on Multi-Scale Convolutional Vision Transformer

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition

Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram

Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks

MViT: Mask Vision Transformer for Facial Expression Recognition in the Wild