Leveraged Mel spectrograms using Harmonic and Percussive Components in Speech Emotion Recognition

David Hason Rudd,Huan Huo,Guandong Xu
DOI: https://doi.org/10.1007/978-3-031-05936-0_31
2023-12-18
Abstract:Speech Emotion Recognition (SER) affective technology enables the intelligent embedded devices to interact with sensitivity. Similarly, call centre employees recognise customers' emotions from their pitch, energy, and tone of voice so as to modify their speech for a high-quality interaction with customers. This work explores, for the first time, the effects of the harmonic and percussive components of Mel spectrograms in SER. We attempt to leverage the Mel spectrogram by decomposing distinguishable acoustic features for exploitation in our proposed architecture, which includes a novel feature map generator algorithm, a CNN-based network feature extractor and a multi-layer perceptron (MLP) classifier. This study specifically focuses on effective data augmentation techniques for building an enriched hybrid-based feature map. This process results in a function that outputs a 2D image so that it can be used as input data for a pre-trained CNN-VGG16 feature extractor. Furthermore, we also investigate other acoustic features such as MFCCs, chromagram, spectral contrast, and the tonnetz to assess our proposed framework. A test accuracy of 92.79% on the Berlin EMO-DB database is achieved. Our result is higher than previous works using CNN-VGG16.
Sound,Computer Vision and Pattern Recognition,Human-Computer Interaction,Machine Learning,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily explores how to maximize the use of harmonic and percussive components of the Mel spectrogram in the task of Speech Emotion Recognition (SER) to improve recognition accuracy. #### Main Research Questions - **How to maximize the use of Mel spectrogram features to improve speech emotion recognition?** Specifically, the paper proposes a new method that processes the harmonic and percussive components of the Mel spectrogram and combines them with log-Mel spectrogram features to construct a hybrid acoustic feature map. This method aims to extract more distinctive and robust features to enhance the accuracy of emotion recognition. #### Research Contributions - Proposed an effective hybrid acoustic feature map technique by combining harmonic and percussive components with log-Mel spectrogram features to construct a new feature representation method. - Used a pre-trained CNN-VGG16 network as a feature extractor and employed a Multi-Layer Perceptron (MLP) for the classification task. - Tuned the parameters of the MLP network to achieve optimal model performance. - Experimental results show that this method achieved a test accuracy of 92.79% on the Berlin EMO-DB database, outperforming previous methods using CNN-VGG16. #### Method Overview - **Feature Extraction**: By decomposing the harmonic and percussive components of the Mel spectrogram and combining them with log-Mel spectrogram features, a hybrid feature map is generated. - **Model Architecture**: CNN-VGG16 is used as the feature extractor, and MLP is used for the classification task. - **Data Augmentation**: The model's generalization ability is improved through effective data augmentation strategies that combine prosodic and acoustic features. #### Experimental Analysis - Experiments were conducted on the Berlin EMO-DB database, and the results show that the proposed hybrid feature map method significantly outperforms other common feature combination techniques. - The model performs better at higher sampling rates and window sizes, especially achieving a test accuracy of 92.79% at 128x128 dimensions and 88200 sampling rate. In summary, this paper aims to explore how to improve speech emotion recognition by utilizing the harmonic and percussive components of the Mel spectrogram and proposes a new hybrid feature representation method, demonstrating its effectiveness in practical applications.