SYLLABLE-DEPENDENT DISCRIMINATIVE LEARNING FOR SMALL FOOTPRINT TEXT-DEPENDENT SPEAKER VERIFICATION

Junyi Peng,Yuexian Zou,Na Li,Deyi Tuo,Dan Su,Meng Yu,Chunlei Zhang,Dong Yu
DOI: https://doi.org/10.1109/asru46091.2019.9004023
2019-01-01
Abstract:This study proposes a novel scheme of syllable-dependent discriminative speaker embedding learning for small footprint text-dependent speaker verification systems. To suppress undesired syllable variation and enhance the power of discrimination inherited in the frame-level features, we design a novel syllable-dependent clustering loss to optimize the network. Specifically, this loss function utilizes syllable labels as auxiliary supervision information to explicitly maximize intersyllable divisibility and intra-syllable compactness between the learned frame-level features. Successively, we propose two syllable-dependent pooling mechanisms to aggregate the frame-level features to several syllable-level features by averaging those features corresponding to each syllable. The utterance-level speaker embeddings with powerful discrimination are then obtained by concatenating the syllable-level features. Experimental results on Tencent voice wake-up dataset show that our proposed scheme can accelerate the network convergence and achieve significant performance improvement against the state-of-the-art methods.
What problem does this paper attempt to address?