Abstract:Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.

A Principle Solution for Enroll-Test Mismatch in Speaker Recognition

FenceSitter: Black-box, Content-Agnostic, and Synchronization-Free Enrollment-Phase Attacks on Speaker Recognition Systems

Squeezing value of cross-domain labels: a decoupled scoring approach for speaker verification

On The Use Of Statistical Ensemble Methods For Telephone-Line Speaker Identification

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

A Cohort-Based Speaker Model Synthesis for Mismatched Channels in Speaker Verification

A Simulation Study on Optimal Scores for Speaker Recognition

A speaker verification backend with robust performance across conditions

Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

Multiobjective Optimization Training of PLDA for Speaker Verification

PCAD: Towards ASR-Robust Spoken Language Understanding via Prototype Calibration and Asymmetric Decoupling

Recursive Whitening Transformation for Speaker Recognition on Language Mismatched Condition

Learning from human perception to improve automatic speaker verification in style-mismatched conditions

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

A study of the robustness of raw waveform based speaker embeddings under mismatched conditions

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Mismatched Feature Detection with Finer Granularity for Emotional Speaker Recognition.

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

Score domain speaking rate normalization for speaker recognition

SDBM-based speaker recognition for speaking style variations