Abstract:Speech Emotion Recognition (SER) affective technology enables the intelligent embedded devices to interact with sensitivity. Similarly, call centre employees recognise customers' emotions from their pitch, energy, and tone of voice so as to modify their speech for a high-quality interaction with customers. This work explores, for the first time, the effects of the harmonic and percussive components of Mel spectrograms in SER. We attempt to leverage the Mel spectrogram by decomposing distinguishable acoustic features for exploitation in our proposed architecture, which includes a novel feature map generator algorithm, a CNN-based network feature extractor and a multi-layer perceptron (MLP) classifier. This study specifically focuses on effective data augmentation techniques for building an enriched hybrid-based feature map. This process results in a function that outputs a 2D image so that it can be used as input data for a pre-trained CNN-VGG16 feature extractor. Furthermore, we also investigate other acoustic features such as MFCCs, chromagram, spectral contrast, and the tonnetz to assess our proposed framework. A test accuracy of 92.79% on the Berlin EMO-DB database is achieved. Our result is higher than previous works using CNN-VGG16.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily explores how to maximize the use of harmonic and percussive components of the Mel spectrogram in the task of Speech Emotion Recognition (SER) to improve recognition accuracy. #### Main Research Questions - **How to maximize the use of Mel spectrogram features to improve speech emotion recognition?** Specifically, the paper proposes a new method that processes the harmonic and percussive components of the Mel spectrogram and combines them with log-Mel spectrogram features to construct a hybrid acoustic feature map. This method aims to extract more distinctive and robust features to enhance the accuracy of emotion recognition. #### Research Contributions - Proposed an effective hybrid acoustic feature map technique by combining harmonic and percussive components with log-Mel spectrogram features to construct a new feature representation method. - Used a pre-trained CNN-VGG16 network as a feature extractor and employed a Multi-Layer Perceptron (MLP) for the classification task. - Tuned the parameters of the MLP network to achieve optimal model performance. - Experimental results show that this method achieved a test accuracy of 92.79% on the Berlin EMO-DB database, outperforming previous methods using CNN-VGG16. #### Method Overview - **Feature Extraction**: By decomposing the harmonic and percussive components of the Mel spectrogram and combining them with log-Mel spectrogram features, a hybrid feature map is generated. - **Model Architecture**: CNN-VGG16 is used as the feature extractor, and MLP is used for the classification task. - **Data Augmentation**: The model's generalization ability is improved through effective data augmentation strategies that combine prosodic and acoustic features. #### Experimental Analysis - Experiments were conducted on the Berlin EMO-DB database, and the results show that the proposed hybrid feature map method significantly outperforms other common feature combination techniques. - The model performs better at higher sampling rates and window sizes, especially achieving a test accuracy of 92.79% at 128x128 dimensions and 88200 sampling rate. In summary, this paper aims to explore how to improve speech emotion recognition by utilizing the harmonic and percussive components of the Mel spectrogram and proposes a new hybrid feature representation method, demonstrating its effectiveness in practical applications.

Leveraged Mel spectrograms using Harmonic and Percussive Components in Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram

Self-attention Transfer Networks for Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers

Speech Emotion Recognition Using Mel-Frequency Cepstral Coefficients & Convolutional Neural Networks

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Speech Emotion Recognition Based on Formant Characteristics Feature Extraction and Phoneme Type Convergence.

A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM

Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Paralinguistic and spectral feature extraction for speech emotion classification using machine learning techniques

Syllable Level Speech Emotion Recognition Based on Formant Attention

Teager_mel And Plp Fusion Feature Based Speech Emotion Recognition

Speech Emotion Recognition Using Mel Frequency Log Spectrogram and Deep Convolutional Neural Network