Abstract:Previously, a dereverberation method based on generalized spectral subtraction (GSS) using multi-channel least mean-squares (MCLMS) has been proposed. The results of speech recognition experiments showed that this method achieved a significant improvement over conventional methods. In this paper, we apply this method to distant-talking (far-field) speaker recognition. However, for far-field speech, the GSS-based dereverberation method using clean speech models degrades the speaker recognition performance. This may be because GSS-based dereverberation causes some distortion between clean speech and dereverberant speech. In this paper, we address this problem by training speaker models using dereverberant speech obtained by suppressing reverberation from arbitrary artificial reverberant speech. Furthermore, we propose an efficient computational method for a combination of the likelihood of dereverberant speech using multiple compensation parameter sets. This addresses the problem of determining optimal compensation parameters for GSS. We report the results of a speaker recognition experiment performed on large-scale far-field speech with different reverberant environments to the training environments. The proposed GSS-based dereverberation method achieves a recognition rate of 92.2%, which compares well with conventional cepstral mean normalization with delay-and-sum beamforming using a clean speech model (49.0%) and a reverberant speech model (88.4%). We also compare the proposed method with another dereverberation technique, multi-step linear prediction-based spectral subtraction (MSLP-GSS). The proposed method achieves a better recognition rate than the 90.6% of MSLP-GSS. The use of multiple compensation parameters further improves the speech recognition performance, giving our approach a recognition rate of 93.6%. We implement this method in a real environment using the optimal compensation parameters estimated from an artificial environment. The results show a recognition rate of 87.8% compared with 72.5% for delay-and-sum beamforming using a reverberant speech model.

Distant-Talking Accent Recognition by Combining Gmm and Dnn

Distant-talking accent recognition by combining GMM and DNN

Deep Neural Network-Based Bottleneck Feature and Denoising Autoencoder-Based Dereverberation for Distant-Talking Speaker Identification.

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

Distant-talking Speaker Identification by Generalized Spectral Subtraction-Based Dereverberation and Its Efficient Computation

Dereverberantion Based on Generalized Spectral Subtraction for Distant-Talking Speaker Recognition

Joint Training of DNNs by Incorporating an Explicit Dereverberation Structure for Distant Speech Recognition

Phonotactic language recognition based on DNN-HMM acoustic model

Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition

Reliable accent specific unit generation with dynamic Gaussian mixture selection for multi-accent speech recognition

Improved Accent Classification Combining Phonetic Vowels with Acoustic Features

Acceleration Strategies for Speech Recognition Based on Deep Neural Networks

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

A robust accent classification system based on variational mode decomposition

Discriminative Dynamic Gaussian Mixture Selection with Enhanced Robustness and Performance for Multi-Accent Speech Recognition

Deep Discriminative Feature Learning for Accent Recognition

Investigation of Deep Neural Network Acoustic Modelling Approaches for Low Resource Accented Mandarin Speech Recognition

Leveraging Native Language Speech for Accent Identification using Deep Siamese Networks

Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition

Improving Blstm Rnn Based Mandarin Speech Recognition Using Accent Dependent Bottleneck Features

DNN-based Voice Activity Detection for Speaker Recognition