Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System.

Zhifu Gao,Yan Song,Ian McLoughlin,Pengcheng Li,Yiheng Jiang,Lirong Dai
DOI: https://doi.org/10.21437/interspeech.2019-1489
2019-01-01
Abstract:Deep embedding learning based speaker verification (SV) methods have recently achieved significant performance improvement over traditional i-vector systems, especially for short duration utterances. Embedding learning commonly consists of three components: frame-level feature processing, utterance-level embedding learning, and loss function to discriminate between speakers. For the learned embeddings, a back-end model (i.e., Linear Discriminant Analysis followed by Probabilistic Linear Discriminant Analysis (LDA-PLDA)) is generally applied as a similarity measure. In this paper, we propose to further improve the effectiveness of deep embedding learning methods in the following components: (1) A multi-stage aggregation strategy, exploited to hierarchically fuse time-frequency context information for effective frame-level feature processing. (2) A discriminant analysis loss is designed for end-to-end training, which aims to explicitly learn the discriminative embeddings, i.e. with small intra-speaker and large inter-speaker variances. To evaluate the effectiveness of the proposed improvements, we conduct extensive experiments on the VoxCeleb1 dataset. The results outperform state-of-the-art systems by a significant margin. It is also worth noting that the results are obtained using a simple cosine metric instead of the more complex LDA-PLDA backend scoring.
What problem does this paper attempt to address?