Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks

Yuxiang Kuang,Qun Wu,Ying Wang,Nilanjan Dey,Fuqian Shi,Rubén González Crespo,R. Simon Sherratt
DOI: https://doi.org/10.1016/j.asoc.2020.106775
IF: 8.7
2020-12-01
Applied Soft Computing
Abstract:<p>Facial expressions, verbal, behavioral, such as limb movements, and physiological features are vital ways for affective human interactions. Researchers have given machines the ability to recognize affective communication through the above modalities in the past decades. In addition to facial expressions, changes in the level of sound, strength, weakness, and turbulence will also convey affective. Extracting affective feature parameters from the acoustic signals have been widely applied in customer service, education, and the medical field. In this research, an improved AlexNet-based deep convolutional neural network (A-DCNN) is presented for acoustic signal recognition. Firstly, preprocessed on signals using simplified inverse filter tracking (SIFT) and short-time Fourier transform (STFT), Mel frequency Cepstrum (MFCC) and waveform-based segmentation were deployed to create the input for the deep neural network (DNN), which was applied widely in signals preprocess for most neural networks. Secondly, acoustic signals were acquired from the public Ryerson Audio-Visual Database of Affective Speech and Song (RAVDESS) affective speech audio system. Through the acoustic signal preprocessing tools, the basic features of the kind of sound signals were calculated and extracted. The proposed DNN based on improved AlexNet has a 95.88% accuracy on classifying eight affective of acoustic signals. By comparing with some linear classifications, such as decision table (DT) and Bayesian inference (BI) and other deep neural networks, such as AlexNet+SVM, recurrent convolutional neural network (R-CNN), etc., the proposed method achieves high effectiveness on the accuracy (A), sensitivity (S1), positive predictive (PP), and f1-score (F1). Acoustic signals affective recognition and classification can be potentially applied in industrial product design through measuring consumers' affective responses to products; by collecting relevant affective sound data to understand the popularity of the product, and furthermore, to improve the product design and increase the market responsiveness.</p>
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?
The paper primarily addresses the issue of emotional speech signal recognition, specifically: 1. **Research Background and Objectives**: The paper points out that affective computing, which can detect, classify, organize, and respond to human emotional communication, is crucial for achieving more friendly and efficient human-computer interaction. Although sound signals themselves do not have emotions, their characteristics such as intensity and pitch can convey emotional information. Therefore, extracting emotional feature parameters from audio signals and applying them in fields such as customer service, education, and healthcare is of great significance. 2. **Method**: To improve the accuracy of emotional speech signal recognition, the paper proposes a deep convolutional neural network (A-DCNN) based on an improved version of AlexNet. First, the audio signals are preprocessed, including Simplified Inverse Filtering Tracking (SIFT) and Short-Time Fourier Transform (STFT), as well as Mel-Frequency Cepstral Coefficients (MFCC) and waveform-based segmentation, to create inputs for the deep neural network. Then, emotional speech signals are obtained from the public dataset RAVDESS, and the basic features of these signals are calculated through preprocessing tools. The proposed deep neural network based on the improved AlexNet achieves a high accuracy rate of 95.88% in the classification task of eight types of emotional speech signals. 3. **Innovations**: Compared to traditional linear classification methods (such as Decision Table DT and Bayesian Inference BI) and other deep neural networks (such as AlexNet+SVM, Recurrent Convolutional Neural Network R-CNN, etc.), the proposed method shows higher effectiveness in terms of accuracy, sensitivity, positive predictive value, and F1 score. 4. **Application Scenarios**: Emotional speech signal recognition and classification technology can be applied in industrial product design to understand the popularity of products by measuring consumers' emotional responses to them, thereby further improving product design and increasing market response speed. In summary, the goal of this paper is to improve the accuracy of emotional speech signal recognition by proposing an improved deep learning framework, thereby providing technical support for applications in related fields.