Abstract:Correspondence: fzheng@tsinghua.edu.cn Center for Speech and Language Technologies, Tsinghua University, ROOM 4-416, Information Sci & Tech Building, Tsinghua University, 100084 Beijing, China Full list of author information is available at the end of the article Abstract One of the state-of-the-art approaches to speaker recognition is based on factor analysis, especially the i-vector model. By representing a speech segment as a vector in a low-dimensional vector space, the i-vector model can deal with the complex correlation among components of the Gaussian mixture model (GMM). On the other hand, it is well known that i-vectors contain both speaker and session variances, and therefore additional discriminative approaches are required to emphasize the speaker-dependent information in the ‘total variance’ space. Among various methods, the probabilistic linear discriminant analysis (PLDA) achieves the significant performance, partly due to its generative model framework that represents the speaker and session variances in a hierarchical way. A disadvantage of PLDA, however, lies in its Gaussian assumptions of the speaker and session variables, which is not necessarily true in most situations. This paper presents a discriminative scoring approach for i-vector-based speaker recognition based on deep neural networks (DNN). This approach casts the recognition task to a binary classification problem and employs the DNN model to learn the complex decision boundary in the heterogeneous speaker space. Compare with the PLDA-based approach, the new approach does not rely on any artificial assumption on the distribution of data, and can optimize the model with respect to the recognition task directly. Our experiments on the NIST SRE08 core test demonstrate that the DNN-based approach outperforms the PLDA-based approach, and find that combining the DNN and PLDA scores leads to further gains. Finally, we compare the DNN model with a discriminative but shallow model, the support vector machine (SVM), and find that the DNN clearly outperforms the SVM, confirming the advantage of deep learning.

Structured Discriminative Models Using Deep Neural-Network Features.

Towards Structured Deep Neural Network for Automatic Speech Recognition

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

Deep neural networks based speaker modeling at different levels of phonetic granularity

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders

Deep Discriminative Feature Learning for Accent Recognition

An Acoustic Model for English Speech Recognition Based on Deep Learning

Acceleration Strategies for Speech Recognition Based on Deep Neural Networks

Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU

A hybrid discriminant fuzzy DNN with enhanced modularity bat algorithm for speech recognition

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Deep neural network architectures for dysarthric speech analysis and recognition

DLD: An Optimized Chinese Speech Recognition Model Based on Deep Learning

An Efficient and Interpre Table Speech Enhancement Network Via Deep Dictionary Learning.

Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition

DNN-based Discriminative Scoring for Speaker Recognition Based on i-vector

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

Building DNN acoustic models for large vocabulary speech recognition

A Maximum Likelihood Approach to Deep Neural Network Based Speech Dereverberation