Abstract:Speech emotion recognition is an indispensable part of realizing natural human-computer interaction and an important part of artificial intelligence. The regulation of speech production organs causes differences in the acoustic features of the emotional speech signal, and thus different emotions are perceived. Traditional speech emotion recognition methods are only focused on classifying emotions based on acoustical features or auditory features, ignoring the important role of speech production directly related features such as glottal source waveform and vocal tract shape cues on emotion perception. In our previous study, the contributions of glottal source and vocal tract cues to the emotion perception in speech have been theoretically analyzed. However, the glottal source and vocal tract features have not been used for speech emotion recognition. Therefore, in this paper, we revisited the possibility of glottal source and vocal tract cues for speech emotion recognition from the point of view of speech production. Motivated by the source-filter model of speech production, we propose a new speech emotion recognition method based on the glottal source and vocal tract features. Firstly, the glottal source and vocal tract features were estimated simultaneously from emotional speech signals based on an analysis-bysynthesis approach with a source-filter model constructed of an Auto-Regressive eXogenous（ARX）model and the Liljencrants-Fant（LF）model. Then, the estimated glottal source and vocal tract features were fed into the Bidirectional Gated Recurrent Unit（BiGRU）network for the speech emotion recognition tasks. The emotion recognition verification were conducted on an public emotion dataset of interactive emotional dyadic motion capture database（IEMOCAP）, and the experimental results showed that the glottal source and vocal tract features could effectively distinguish the emotions, and the emotion recognition accuracy of the glottal source and vocal tract features is superior to that of traditional emotion features. This paper is focused on the advantages of the glottal source and vocal tract features that are directly used for speech emotion recognition, which provides new insight into speech emotion recognition technology.

$F_0$-Noise-robust Glottal Source and Vocal Tract Analysis Based on ARX-LF Model

Modeling and Estimation of Vocal Tract and Glottal Source Parameters Using ARMAX-LF Model

Glottal source estimation robustness: A comparison of sensitivity of voice source estimation techniques

Autoregressive Model-Based Robust Speech Recognition in Additive Noise Environment

A novel source analysis method by matching spectral characters of LF model with STRAIGHT spectrum

Robust F0 Modeling for Mandarin Speech Recognition in Noise.

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Leveraging Laryngograph Data for Robust Voicing Detection in Speech

Flooring the observation probability for robust ASR in impulsive noise

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

Multi-source Based Acoustic Model for Speech Synthesis.

Auditive Learning Based Chinese F0 Prediction

Vocal Tract Area Estimation by Gradient Descent

Neural source-filter waveform models for statistical parametric speech synthesis

Speech Emotion Recognition Based on Glottal Source and Vocal Tract Features

A New De-Noising Arithmetic in the Output Signal of Fog

A Dynamic Glottal Model Through High-Speed Imaging

Analysis of noise robustness of auditory features in speech recognition

Singing Voice Separation and Vocal F0 Estimation based on Mutual Combination of Robust Principal Component Analysis and Subharmonic Summation

Robust multi-reference adaptive gain FxLMS algorithm for active impulsive noise control

HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters