Abstract:We propose a novel iterative mask estimation (IME) framework to improve the state-of-the-art complex Gaussian mixture model (CGMM)-based beamforming approach in an iterative manner by leveraging upon the complementary information obtained from different deep models. Although CGMM has been recently demonstrated to be quite effective for multi-channel, automation speech recognition (ASR) in operational scenarios, the corresponding mask estimation, however, is not always accurate in adverse environments due to the lack of prior or context information. To address this problem, in this study, a neural-network-based ideal ratio mask estimator learned from a multi-condition data set is first adopted to incorporate prior information, obtained from the speech/noise interactions and the long acoustic context, into CGMM-based beamformed speech that has a higher signal-to-noise ratio (SNR) than the original noisy speech signal. Next, to further utilize the rich context information in deep acoustic and language models, voice activity detection information, obtained from speech recognition results, is then used to refine mask estimation, yielding a significant reduction in insertion errors. During testing on the recently launched CHiME-4 Challenge ASR task of recognizing 6-channel microphone array speech, the proposed IME approach significantly and consistently outperforms the CGMM approach under different configurations, with relative word error rate reductions ranging from 20% to 30%. Furthermore, the IME approach plays a key role in the ensemble system that achieves the best performance in the CHiME-4 Challenge.

An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech

An information fusion approach to recognizing microphone array speech in the CHiME-3 challenge based on a deep learning framework

On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reduction

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

An Iterative Mask Estimation Approach to Deep Learning Based Multi-Channel Speech Recognition

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

A Fusion Approach to Spoken Language Identification Based on Combining Multiple Phone Recognizers and Speech Attribute Detectors

Fusion of deep shallow features and models for speaker recognition

Mmmic: Multi-modal Speech Recognition Based on Mmwave Radar.

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion