Abstract:In this paper, we address the problem of multichannel speech enhancement in the short-time Fourier transform (STFT) domain. A long short-time memory (LSTM) network takes as input a sequence of STFT coefficients associated with a frequency bin of multichannel noisy-speech signals. The network's output is the corresponding sequence of single-channel cleaned speech. We propose several clean-speech network targets, namely, the magnitude ratio mask, the complex STFT coefficients and the (smoothed) spatial filter. A prominent feature of the proposed model is that the same LSTM architecture, with identical parameters, is trained across frequency bins. The proposed method is referred to as narrow-band deep filtering. This choice stays in contrast with traditional wide-band speech enhancement methods. The proposed deep filtering is able to discriminate between speech and noise by exploiting their different temporal and spatial characteristics: speech is non-stationary and spatially coherent while noise is relatively stationary and weakly correlated across channels. This is similar in spirit with unsupervised techniques, such as spectral subtraction and beamforming. We describe extensive experiments with both mixed signals (noise is added to clean speech) and real signals (live recordings). We empirically evaluate the proposed architecture variants using speech enhancement and speech recognition metrics, and we compare our results with the results obtained with several state of the art methods. In the light of these experiments we conclude that narrow-band deep filtering has very good speech enhancement and speech recognition performance, and excellent generalization capabilities in terms of speaker variability and noise type.

An enhanced RASTA filtering of speech

An Enhanced RASTA Processing for Speaker Identification

Design and implementation of a speaker recognition system

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

CHANNEL COMPENSATION OF TELEPHONE SPEECH RECOGNITION BASED ON MFCCs FILTERING

Multi-resolution Time Frequency Feature and Complementary Combination for Short Utterance Speaker Recognition

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Lite-RTSE: Exploring a Cost-Effective Lite DNN Model for Real-Time Speech Enhancement in RTC Scenarios

ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition

The predictive differential amplitude spectrum for robust speaker recognition in stationary noises

Robust Speech Recognition by Selecting Mel-Filter Banks

Narrow-band Deep Filtering for Multichannel Speech Enhancement

Robust Log-Energy Estimation and Its Dynamic Change Enhancement for In-car Speech Recognition

Speaker recognition with two-step multi-modal deep cleansing

Noise Robust Speaker Recognition Based on Adaptive Frame Weighting in GMM for i-Vector Extraction.

Variant Time-Frequency Cepstral Features for Speaker Recognition

High Performance Digit Mandarin Speech Recognition

An Improved Speech Enhancement Algorithm Based on Wavelet Transform

Robust speech recognition in noisy backgrounds based on Teager energy operator and auditory process

Enhancement of Non-air Conduct Speech Based on Multi-band Spectral Subtraction Method