A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Mustaqeem,Soonil Kwon

DOI: https://doi.org/10.3390/s20010183

IF: 3.9

2019-12-28

Sensors

Abstract:Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker’s emotional state from an individual’s speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.

engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the accuracy of Speech Emotion Recognition (SER) and reduce the computational complexity of SER models. Specifically, the author proposes an architecture based on the Deep Stride Convolutional Neural Network (DSCNN), aiming to learn significant and discriminative features from the spectrogram of speech signals to improve the performance of emotion recognition. By using a special stride to directly down - sample the feature map in the convolution layer instead of the traditional pooling layer, this method can reduce the size and computational cost of the model while maintaining high accuracy. The main contributions of the paper include: 1. **Pre - processing**: A new adaptive threshold technique is proposed to remove background noise and silent parts, thereby improving the quality of data and providing cleaner input for subsequent emotion recognition. 2. **CNN model**: A new DSCNN architecture is designed. This architecture uses small - sized filters (3×3) and a specific stride (2×2) to directly perform down - sampling in the convolution layer, thus effectively extracting high - level features. 3. **Computational complexity**: By reducing the number of convolution layers and using smaller convolution kernels, the overall computational complexity of the model is reduced while the accuracy of emotion recognition is improved. Experimental results show that this method improves the accuracy by 7.85% and 4.5% on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio - Visual Database of Emotional Speech and Song (RAVDESS) datasets respectively, and the model size is reduced by 34.5 MB. This proves the effectiveness and practical application value of the proposed SER technology.

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Human-Computer Interaction with Detection of Speaker Emotions Using Convolution Neural Networks

A Combined CNN Architecture for Speech Emotion Recognition

Effective MLP and CNN based ensemble learning for speech emotion recognition

CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach

A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora

EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition Based on Syllable-Level Feature Extraction