Abstract:Exploring proper way to conduct multi-speech feature fusion for cross-corpus speech emotion recognition is crucial as different speech features could provide complementary cues reflecting human emotion status. While most previous approaches only extract a single speech feature for emotion recognition, existing fusion methods such as concatenation, parallel connection, and splicing ignore heterogeneous patterns in the interaction between features and features, resulting in performance of existing systems. In this paper, we propose a novel graph-based fusion method to explicitly model the relationships between every pair of speech features. Specifically, we propose a multi-dimensional edge features learning strategy called Graph-based multi-Feature fusion method for speech emotion recognition. It represents each speech feature as a node and learns multi-dimensional edge features to explicitly describe the relationship between each feature-feature pair in the context of emotion recognition. This way, the learned multi-dimensional edge features encode speech feature-level information from both the vertex and edge dimensions. Our Approach consists of three modules: an Audio Feature Generation(AFG)module, an Audio-Feature Multi-dimensional Edge Feature(AMEF) module and a Speech Emotion Recognition (SER) module. The proposed methodology yielded satisfactory outcomes on the SEWA dataset. Furthermore, the method demonstrated enhanced performance compared to the baseline in the AVEC 2019 Workshop and Challenge. We used data from two cultures as our training and validation sets: two cultures containing German and Hungarian on the SEWA dataset, the CCC scores for German are improved by 17.28% for arousal and 7.93% for liking. The outcomes of our methodology demonstrate a 13% improvement over alternative fusion techniques, including those employing one dimensional edge-based feature fusion approach.

Fusion Of Global Statistical And Segmental Spectral Features For Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Speech Emotion Recognition Using Both Spectral and Prosodic Features

An autoencoder-based feature level fusion for speech emotion recognition

Graph-based multi-Feature fusion method for speech emotion recognition

Ann Based Decision Fusion for Speech Emotion Recognition

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Spectrogram feature extraction algorithm for speech emotion recognition

Speech Emotion Recognition Based on Multi-feature and Multi-lingual Fusion

Spectral Features Based on Local Hu Moments of Gabor Spectrograms for Speech Emotion Recognition

Speech emotion recognition using combination of features

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

A Hybrid Speech Emotion Recognition System Based On Spectral And Prosodic Features

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Speech Emotion Recognition Based on Feature Selection and Extreme Learning Machine Decision Tree

Study on Method of Emotion Recognition of Speech Based on Feature Parameter Fusion

Multi-modal Emotion Recognition Based on Speech and Image.

Improved emotion recognition with novel global utterance-level features

GCF2-Net: global-aware cross-modal feature fusion network for speech emotion recognition