Speech Emotion Recognition with Complementary Acoustic Representations.

Xiaoming Zhang,Fan Zhang,Xiaodong Cui,Wei Zhang
DOI: https://doi.org/10.1109/slt54892.2023.10023133
2023-01-01
Abstract:Since CNNs promote local features and Transformers capture long-range dependencies, we explore both models as encoders for acoustic representations in a parallel framework for speech emotion recognition. We choose logMels as input to the CNN encoder and MFCCs to the Transformer encoder. The complementary acoustic representations generated by the two encoders are then fused to predict the frequency distribution of emotions. To further improve the performance, we conduct data augmentation based on vocal tract length perturbation and pretrain the Transformer encoder. The proposed framework is evaluated under the speaker-independent (SI) setting on the improvisation part of the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. Our weighted and unweighted accuracies reached 81.6% and 79.8%, respectively. To the best of our knowledge, this is the state-of-the-art result reported so far on this dataset in the SI scenario.
What problem does this paper attempt to address?