Abstract:When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grouping of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called "speaker separation via neural diarization" (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments-a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin.

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings.

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis

MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Speaker conditioned acoustic modeling for multi-speaker conversational ASR

The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

Advancing Multi-talker ASR Performance with Large Language Models

CASA-ASR: Context-Aware Speaker-Attributed ASR.

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

Multi-channel Conversational Speaker Separation via Neural Diarization

Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection

Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition