DNN-based Discriminative Scoring for Speaker Recognition Based on i-vector
Jun Wang,Dong Wang,Thomas Fang Zheng,Fanhu Bie
2015-01-01
Abstract:Correspondence: fzheng@tsinghua.edu.cn Center for Speech and Language Technologies, Tsinghua University, ROOM 4-416, Information Sci & Tech Building, Tsinghua University, 100084 Beijing, China Full list of author information is available at the end of the article Abstract One of the state-of-the-art approaches to speaker recognition is based on factor analysis, especially the i-vector model. By representing a speech segment as a vector in a low-dimensional vector space, the i-vector model can deal with the complex correlation among components of the Gaussian mixture model (GMM). On the other hand, it is well known that i-vectors contain both speaker and session variances, and therefore additional discriminative approaches are required to emphasize the speaker-dependent information in the ‘total variance’ space. Among various methods, the probabilistic linear discriminant analysis (PLDA) achieves the significant performance, partly due to its generative model framework that represents the speaker and session variances in a hierarchical way. A disadvantage of PLDA, however, lies in its Gaussian assumptions of the speaker and session variables, which is not necessarily true in most situations. This paper presents a discriminative scoring approach for i-vector-based speaker recognition based on deep neural networks (DNN). This approach casts the recognition task to a binary classification problem and employs the DNN model to learn the complex decision boundary in the heterogeneous speaker space. Compare with the PLDA-based approach, the new approach does not rely on any artificial assumption on the distribution of data, and can optimize the model with respect to the recognition task directly. Our experiments on the NIST SRE08 core test demonstrate that the DNN-based approach outperforms the PLDA-based approach, and find that combining the DNN and PLDA scores leads to further gains. Finally, we compare the DNN model with a discriminative but shallow model, the support vector machine (SVM), and find that the DNN clearly outperforms the SVM, confirming the advantage of deep learning.