Multi-level Feature Joint Learning Methods for Emotional Speaker Recognition

Zhongliang Zeng,Dongdong Li,Zhe Wang,Hai Yang
DOI: https://doi.org/10.1109/IJCNN54540.2023.10191744
2023-01-01
Abstract:In the real scene, changes in speaker features caused by different emotional states have a great impact on the performance of speaker recognition. To improve the robustness of the speaker recognition system, the existing emotional speaker recognition technologies tend to cascade different models, ignoring the frame-level acoustic features and the segment-level discourse habits feature. To this end, we combine frame- and segment-level features in different ways to build a robust recognition system for emotional speakers. The frame-level features and segment-level features are jointly learned to retain emotional information and speaker information. Four joint learning methods, namely, Joint in series, Joint in Parallel, Joint under the guidance, and Joint with Original Feature, are discussed to explore the correlations between fragment-level features and frame-level features. The experimental results illustrate that the speaker feature will change greatly in different emotional states. Compared with the accuracy of 90.95% by x-vector, the proposed methods of Joint in parallel and Joint with Original Features can achieve the accuracy of 95.06% and 94.67% respectively for emotional speaker recognition in the experiment on Mandarin Affective Speech Corpus (MASC). Our findings provide a novel aspect to improve speaker recognition robustness.
What problem does this paper attempt to address?