Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks

Yuxiang Kuang,Qun Wu,Ying Wang,Nilanjan Dey,Fuqian Shi,Rubén González Crespo,R. Simon Sherratt

DOI: https://doi.org/10.1016/j.asoc.2020.106775

IF: 8.7

2020-12-01

Applied Soft Computing

Abstract:<p>Facial expressions, verbal, behavioral, such as limb movements, and physiological features are vital ways for affective human interactions. Researchers have given machines the ability to recognize affective communication through the above modalities in the past decades. In addition to facial expressions, changes in the level of sound, strength, weakness, and turbulence will also convey affective. Extracting affective feature parameters from the acoustic signals have been widely applied in customer service, education, and the medical field. In this research, an improved AlexNet-based deep convolutional neural network (A-DCNN) is presented for acoustic signal recognition. Firstly, preprocessed on signals using simplified inverse filter tracking (SIFT) and short-time Fourier transform (STFT), Mel frequency Cepstrum (MFCC) and waveform-based segmentation were deployed to create the input for the deep neural network (DNN), which was applied widely in signals preprocess for most neural networks. Secondly, acoustic signals were acquired from the public Ryerson Audio-Visual Database of Affective Speech and Song (RAVDESS) affective speech audio system. Through the acoustic signal preprocessing tools, the basic features of the kind of sound signals were calculated and extracted. The proposed DNN based on improved AlexNet has a 95.88% accuracy on classifying eight affective of acoustic signals. By comparing with some linear classifications, such as decision table (DT) and Bayesian inference (BI) and other deep neural networks, such as AlexNet+SVM, recurrent convolutional neural network (R-CNN), etc., the proposed method achieves high effectiveness on the accuracy (A), sensitivity (S1), positive predictive (PP), and f1-score (F1). Acoustic signals affective recognition and classification can be potentially applied in industrial product design through measuring consumers' affective responses to products; by collecting relevant affective sound data to understand the popularity of the product, and furthermore, to improve the product design and increase the market responsiveness.</p>

computer science, artificial intelligence, interdisciplinary applications

What problem does this paper attempt to address?

The paper primarily addresses the issue of emotional speech signal recognition, specifically: 1. **Research Background and Objectives**: The paper points out that affective computing, which can detect, classify, organize, and respond to human emotional communication, is crucial for achieving more friendly and efficient human-computer interaction. Although sound signals themselves do not have emotions, their characteristics such as intensity and pitch can convey emotional information. Therefore, extracting emotional feature parameters from audio signals and applying them in fields such as customer service, education, and healthcare is of great significance. 2. **Method**: To improve the accuracy of emotional speech signal recognition, the paper proposes a deep convolutional neural network (A-DCNN) based on an improved version of AlexNet. First, the audio signals are preprocessed, including Simplified Inverse Filtering Tracking (SIFT) and Short-Time Fourier Transform (STFT), as well as Mel-Frequency Cepstral Coefficients (MFCC) and waveform-based segmentation, to create inputs for the deep neural network. Then, emotional speech signals are obtained from the public dataset RAVDESS, and the basic features of these signals are calculated through preprocessing tools. The proposed deep neural network based on the improved AlexNet achieves a high accuracy rate of 95.88% in the classification task of eight types of emotional speech signals. 3. **Innovations**: Compared to traditional linear classification methods (such as Decision Table DT and Bayesian Inference BI) and other deep neural networks (such as AlexNet+SVM, Recurrent Convolutional Neural Network R-CNN, etc.), the proposed method shows higher effectiveness in terms of accuracy, sensitivity, positive predictive value, and F1 score. 4. **Application Scenarios**: Emotional speech signal recognition and classification technology can be applied in industrial product design to understand the popularity of products by measuring consumers' emotional responses to them, thereby further improving product design and increasing market response speed. In summary, the goal of this paper is to improve the accuracy of emotional speech signal recognition by proposing an improved deep learning framework, thereby providing technical support for applications in related fields.

Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks

Hybrid Network Feature Extraction for Depression Assessment from Speech

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

Attention-Based Acoustic Feature Fusion Network for Depression Detection

Automated depression analysis using convolutional neural networks from speech

Research on Chinese Speech Emotion Recognition Based on Deep Neural Network and Acoustic Features

Towards Robust Deep Neural Networks for Affect and Depression Recognition from Speech

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Improving Speech Recognition with Convolutional Neural Networks

Automatic Facial Expression Recognition Based on a Deep Convolutional-Neural-network Structure

Adaptive DCTNet for Audio Signal Classification

THE CONSTRUCTION OF A NEURAL NETWORK MODEL FOR SPEECH EMOTION RECOGNITION

Automated Affective Computing Based on Bio-Signals Analysis and Deep Learning Approach

Detection of Emotion of Speech for RAVDESS Audio Using Hybrid Convolution Neural Network

MFCC-based Recurrent Neural Network for automatic clinical depression recognition and assessment from speech

Emotion Recognition in Audio and Video Using Deep Neural Networks

Speech emotion recognition with deep convolutional neural networks

Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition

Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks