Bridging Mixture Density Networks with Meta-Learning for Automatic Speaker Identification

Ruirui Li,Jyun-Yu Jiang,Xian Wu,Hongda Mao,Chu-Cheng Hsieh,Wei Wang
DOI: https://doi.org/10.1109/icassp40776.2020.9054111
2020-01-01
Abstract:Speaker identification answers the fundamental question "Who is speaking" The identification technology enables various downstream applications to provide a personalized experience. Both the prevalent i-vector based solutions and the state-of-the-art deep learning solutions usually treat all users equally, with no distinctions between new users and existing users, during the training process. We notice that a good many new users start with limited labeled training data, which often results in inferior predicting performance of recognizing users' voices. To alleviate the disadvantage caused by training data deficiency, we propose a Mixture Density Network- based Meta-Learning method (MDNML) for speaker identification. MDNML emphasizes the expeditious process of learning to recognize new users where each has only a few seconds of labeled data. We conduct experiments on the LibriSpeech dataset and compare MDNML with four state-of-the-art baseline methods. The results conclude that MDNML achieves higher accuracy in recognizing new users with limited labeled utterances than all baseline methods. Our proposed solution significantly expedites the learning by transferring the knowledge learned from the existing user base through gradient-based meta-learning. We consider our work to be a steppingstone for more sophisticated meta-learning frameworks for accelerating voice recognition. Furthermore, we discuss a strategy for enhancing the accuracy by incorporating the notion of household-based acoustic profiles with MDNML.
What problem does this paper attempt to address?