Abstract:Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

Speaker and Direction Inferred Dual-channel Speech Separation

A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge.

Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System

On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones

Single-Channel Multi-Speaker Separation using Deep Clustering

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

Acoustic Model Ensembling Using Effective Data Augmentation for CHiME-5 Challenge.

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Cracking the cocktail party problem by multi-beam deep attractor network

Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

The xmuspeech system for multi-channel multi-party meeting transcription challenge