Abstract:Correspondence: fzheng@tsinghua.edu.cn Center for Speech and Language Technologies, Tsinghua University, ROOM 4-416, Information Sci & Tech Building, Tsinghua University, 100084 Beijing, China Full list of author information is available at the end of the article Abstract One of the state-of-the-art approaches to speaker recognition is based on factor analysis, especially the i-vector model. By representing a speech segment as a vector in a low-dimensional vector space, the i-vector model can deal with the complex correlation among components of the Gaussian mixture model (GMM). On the other hand, it is well known that i-vectors contain both speaker and session variances, and therefore additional discriminative approaches are required to emphasize the speaker-dependent information in the ‘total variance’ space. Among various methods, the probabilistic linear discriminant analysis (PLDA) achieves the significant performance, partly due to its generative model framework that represents the speaker and session variances in a hierarchical way. A disadvantage of PLDA, however, lies in its Gaussian assumptions of the speaker and session variables, which is not necessarily true in most situations. This paper presents a discriminative scoring approach for i-vector-based speaker recognition based on deep neural networks (DNN). This approach casts the recognition task to a binary classification problem and employs the DNN model to learn the complex decision boundary in the heterogeneous speaker space. Compare with the PLDA-based approach, the new approach does not rely on any artificial assumption on the distribution of data, and can optimize the model with respect to the recognition task directly. Our experiments on the NIST SRE08 core test demonstrate that the DNN-based approach outperforms the PLDA-based approach, and find that combining the DNN and PLDA scores leads to further gains. Finally, we compare the DNN model with a discriminative but shallow model, the support vector machine (SVM), and find that the DNN clearly outperforms the SVM, confirming the advantage of deep learning.

DNN-based Voice Activity Detection for Speaker Recognition

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

Applying Support Vector Machines to Voice Activity Detection

Voice Activity Detection Based on Time-Delay Neural Networks

Denoising Deep Neural Networks Based Voice Activity Detection

A Universal VAD Based on Jointly Trained Deep Neural Networks.

Speech recognition method based on DNN-LSTM combined with Wiener filtering algorithm

Deep Belief Networks Based Voice Activity Detection

An energy-efficient voice activity detector using reconfigurable Gaussian base normalization deep neural network

sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks

Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Personal VAD: Speaker-Conditioned Voice Activity Detection

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

DNN-based Discriminative Scoring for Speaker Recognition Based on i-vector

Deep Neural Network-Based Bottleneck Feature and Denoising Autoencoder-Based Dereverberation for Distant-Talking Speaker Identification.

DNN based Speaker Recognition on Short Utterances

Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments