Abstract:We propose a novel speaker-dependent speech separation framework for the challenging CHiME-5 acoustic environments, exploiting advantages of both deep learning based and conventional preprocessing techniques to prepare data effectively for separating target speech from multi-talker mixed speech collected with multiple microphone arrays. First, a series of multi-channel operations is conducted to reduce existing reverberation and noise, and a single-channel deep learning based speech enhancement model is used to predict speech presence probabilities. Next, a two-stage supervised speech separation approach, using oracle speaker diarization information from CHiME-5, is proposed to separate speech of a target speaker from interference speakers in mixed speech. Given a set of three estimated masks of the background noise, the target speaker and the interference speakers from single-channel speech enhancement and separation models, a complex Gaussian mixture model based generalized eigenvalue beam-former is then used for enhancing the signal at the reference array while avoiding the speaker permutation issue. Furthermore, the proposed front-end can generate a large variety of processed data for an ensemble of speech recognition results. Experiments on the development set have shown that the proposed two-stage approach can yield significant improvements of recognition performance over the official baseline system and achieved top accuracies in all four competing evaluation categories among all systems submitted to the CHiME-5 Challenge.

Multi-speaker Segmentation and Clustering of Telephone Speech

Speaker Segmentation and Clustering Based on the Improved Spectral Clustering

UBM Based Speaker Segmentation and Clustering for 2-Speaker Detection

A new DP-like speaker clustering algorithm

Single-Channel Multi-Speaker Separation using Deep Clustering

An Improved Speaker Based Speech Segmentation Algorithm

Speaker clustering method for distributed microphone

Efficient Audio Stream Segmentation Via the Combined T-2 Statistic and Bayesian Information Criterion

A Quick and Effective Speaker Diarization System.

Speaker Adaptation for Telephony Data Using Speaker Clustering

A pitch-based rapid speech segmentation for speaker indexing

VB-HMM Speaker Diarization with Enhanced and Refined Segment Representation.

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

A New Robust Telephone Speech Recognition Algorithm With The Multi-Model Structures

Speaker Clustering Algorithm in Speech Recognition

Using confidence measures to evaluate the speaker turns in speaker segmentation

Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens

Multi-feature Combination for Speaker Recognition

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

Speaker Recognition Using DMFCC over Telephone Channels

Speaker Segmentation Based on Between-Window Correlation over Speakers' Characteristics