Abstract:Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.

Automatic multi-speaker speech recognition system based on time-frequency blind source separation under ubiquitous environment

Design and implementation of a speaker recognition system

Glottal Information Based Spectral Recuperation in Multi-channel Speaker Recognition

Experiments on Blind Speech Separations

CASA Based Speech Separation for Robust Speech Recognition

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

CASA Based Speech Separation for

Robust Front-End for Speech Recognition Based on Computational Auditory Scene Analysis and Speaker Model

Speaker Recognition System in Multi-Channel Environment

A Real-time Speaker Diarization System Based on Spatial Spectrum

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Real-time Architecture for Audio-Visual Active Speaker Detection.

Audio-visual multi-channel speech separation, dereverberation and recognition

DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation

INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

Speech Selection and Environmental Adaptation for Asynchronous Speech Recognition

Low-SNR Speech Enhancement and Separation in Driving Environment

An automatic mixing speech enhancement system for multi-track audio

Separating Voices from Multiple Sound Sources Using 2D Microphone Array