Abstract:Speech emotion recognition (SER) is a crucial topic in human–computer interaction. However, there are still many challenges to extracting emotional embeddings. Emotional embeddings extracted by network models often contain noise and incomplete emotional information. To meet these challenges, this study developed an innovative model (MVIB-DVA) composed of a multi-feature variational information bottleneck (MVIB) based on the information bottleneck (IB) principle and a dual-view aware module (DVAM) with an attention mechanism . MVIB employs the IB principle as the driving model and utilizes learned minimal sufficient single-feature emotional embeddings as auxiliary information. The aims are to capture unique emotional information in individual features and complementary information between different types of features as well as reduce noise and represent rich emotional information with fewer parameters. DVAM proposes (1) a frequency-domain statistical aware module (FDSAM) in the frequency domain that emphasizes the frequency that best reflects emotional information and (2) a frame aware module (FAM) in the time domain that focuses on the frames that contribute to the most to the final emotion recognition results. A separate channel extracts details ignored in the frequency and time domain views, extracting more comprehensive emotional information. The experimental results show that our method performs excellently in recognizing speech emotions. MVIB-DVA achieved weighted accuracy (WA) of 74.05% and unweighted accuracy (UA) of 75.67% on the IEMOCAP dataset. Similarly, on the RAVDESS dataset, MVIB-DVA attained WA of 86.66% and UA of 86.51%.

Using I-Vector Space Model For Emotion Recognition

Emotional speaker recognition based on i-vector through Atom Aligned Sparse Representation

Applying Emotional Factor Analysis And I-Vector To Emotional Speaker Recognition

Speech Emotion Recognition With I-Vector Feature And Rnn Model

Emotional Speaker Recognition Based on Model Space Migration through Translated Learning.

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

DBN-ivector Framework for Acoustic Emotion Recognition

Scores Selection for Emotional Speaker Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

A Preliminary Study on GMM Weight Transformation for Emotional Speaker Recognition

Speech Emotion Classification Using Acoustic Features

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Relative entropy normalized Gaussian supervector for speech emotion recognition using kernel extreme learning machine.

GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition

Speech Emotion Recognition With Acoustic And Lexical Features

Speech emotion recognition using combination of features

Emotion Invariant Speaker Embeddings for Speaker Identification with Emotional Speech

Speech Emotion Recognition Based on SVM and GMM-HMM Hybrid System

MVIB-DVA: Learning minimum sufficient multi-feature speech emotion embeddings under dual-view aware

I-Vector Based Speaker Gender Recognition

The Role of Phonetic Units in Speech Emotion Recognition