Abstract:We propose a novel iterative mask estimation (IME) framework to improve the state-of-the-art complex Gaussian mixture model (CGMM)-based beamforming approach in an iterative manner by leveraging upon the complementary information obtained from different deep models. Although CGMM has been recently demonstrated to be quite effective for multi-channel, automation speech recognition (ASR) in operational scenarios, the corresponding mask estimation, however, is not always accurate in adverse environments due to the lack of prior or context information. To address this problem, in this study, a neural-network-based ideal ratio mask estimator learned from a multi-condition data set is first adopted to incorporate prior information, obtained from the speech/noise interactions and the long acoustic context, into CGMM-based beamformed speech that has a higher signal-to-noise ratio (SNR) than the original noisy speech signal. Next, to further utilize the rich context information in deep acoustic and language models, voice activity detection information, obtained from speech recognition results, is then used to refine mask estimation, yielding a significant reduction in insertion errors. During testing on the recently launched CHiME-4 Challenge ASR task of recognizing 6-channel microphone array speech, the proposed IME approach significantly and consistently outperforms the CGMM approach under different configurations, with relative word error rate reductions ranging from 20% to 30%. Furthermore, the IME approach plays a key role in the ensemble system that achieves the best performance in the CHiME-4 Challenge.

LSTM-Based Iterative Mask Estimation and Post-Processing for Multi-Channel Speech Enhancement

An Iterative Mask Estimation Approach to Deep Learning Based Multi-Channel Speech Recognition

A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge.

Attention Bidirectional LSTM Networks Based Mime Speech Recognition Using Semg Data

A Speech Enhancement Neural Network Architecture with SNR-Progressive Multi-Target Learning for Robust Speech Recognition

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones

Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

Multichannel Speech Enhancement Based on Time-Frequency Masking Using Subband Long Short-Term Memory

2D-to-2d Mask Estimation for Speech Enhancement Based on Fully Convolutional Neural Network

Masks Fusion with Multi-Target Learning For Speech Enhancement

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Combining Spatial Clustering with LSTM Speech Models for Multichannel Speech Enhancement

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement

A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement.

An Iterative Post-processing Approach for Speech Enhancement

PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition

Reference Channel Selection by Multi-Channel Masking for End-to-End Multi-Channel Speech Enhancement

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

The THU-SPMI CHiME-4 system : Lightweight design with advanced multi-channel processing , feature enhancement , and language modeling